# Applied Data Lab

# Assignment 03: Data Cleaning

## String Operations

String operations are used to manipulate text data in various ways.









### `strip()`

One common operation is removing spaces at the beginning or end of a string, which can be done using the `.strip()` method. This method effectively trims leading and trailing spaces and returns the modified string.

**Example:**

```python
# Remove leading and trailing spaces from a string
text = "  hello  "
result = text.strip()
print(result)  # Output: "hello"
```

If an argument is passed, it specifies which characters to remove from the beginning and end of the string.

**Example:**

```python
# Remove leading and trailing passed argument from a string
text = "???text with spaces???"
result = text.strip("?")
print(result)  # Output: "text with spaces"
```


In [None]:
# Run Example
#
#

### `rstrip()`

The `rstrip()` method is used to remove trailing (right-hand) characters or a sequence of characters from a string. You pass the characters you want to remove as an argument to this method. It will remove those characters from the end of the string, starting from the right side.

**Example:**

```python
# Remove trailing question marks from the end of a string
text = "???text with spaces???"
result = text.rstrip("?")
print(result)  # Output: "???text with spaces"
```

### `lstrip()`

The `lstrip()` method is used to remove leading (left-hand) characters or a sequence of characters from a string. You pass the characters you want to remove as an argument to this method. It will remove those characters from the beginning of the string, starting from the left side.

**Example:**

```python
# Remove leading question marks from the beginning of a string
text = "???text with spaces???"
result = text.lstrip("?")
print(result)  # Output: "text with spaces???"
```

In [None]:
# Run These Examples
#
#

### `lower()`

You can convert text to lowercase using the `.lower()` method. This operation is especially useful for making text data consistent and easier to work with. It transforms all characters in a string to their lowercase form.

**Example:**

```python
# Convert text to lowercase
text = "Hello World"
lowercase_text = text.lower()
print(lowercase_text)  # Output: "hello world"
```

In [None]:
# Run Example
#
#

### `replace()`

The `.replace()` method in pandas is used to replace a specified substring or character in a string with another substring. This can be helpful for cleaning and modifying text data.

**Example:**

```python
# Replace a substring in a string
text = "I like cats, but some people like dogs."
modified_text = text.replace("cats", "hamsters")
print(modified_text)  # Output: "I like hamsters, but some people like dogs."
```

In [None]:
# Run Example
#
#

### Exercise 1: Cleaning Text

**Objective:** Clean and preprocess the given text.

```python
text = "    A warden, a powerful but evadable $hostile mob $, summoned by sculk shriekers in the deep dark biome, attacks by swinging its arms downward. It deals the highest melee damage among all mobs and can also release a sonic boom attack. This attack homes in on the target, pierces obstacles, and only the Resistance effect can reduce its damage. Wardens are blind and rely on vibrations, smell, and touch to detect players and mobs for attack. They can be evaded via sneaking, diversions, and wool.????       "
```

**Instructions:**

1. Apply the `.strip()` method to remove leading and trailing spaces from the text.

2. Apply the `.lower()` method to convert the text to lowercase.

3. Use the `.replace()` method to remove any dollar signs ('$') by replacing them with an empty string ('').

4. Use the `.replace()` method again to remove any commas (',') and periods ('.') by replacing them with spaces.

5. Apply the `.split()` method to convert the cleaned text into a list of words.

6. Convert the list of words into a set to remove duplicate words.

**HINT:** `set(list_of_words)`
A set effectively eliminates duplicate values, retaining only unique elements from the list.

7. Convert all the values of set back to string using `" ".join(set_of_words)`, then print it.


In [28]:
# Do Exercise in this cell
#
#
#
text = "    A warden, a powerful but evadable $hostile mob $, summoned by sculk shriekers in the deep dark biome, attacks by swinging its arms downward. It deals the highest melee damage among all mobs and can also release a sonic boom attack. This attack homes in on the target, pierces obstacles, and only the Resistance effect can reduce its damage. Wardens are blind and rely on vibrations, smell, and touch to detect players and mobs for attack. They can be evaded via sneaking, diversions, and wool.????       "

## Setting Up the Address
In this cell, a path variable is set with the value of the current directory where the notebook is open. This is done to easily upload the dataset file from this location.

In [42]:
import pandas as pd

In [None]:
# Run this cell
import os
PATH = os.getcwd() + '/'
PATH

**ONLY FOR GOOGLE COLAB USERS**

For those who are using **Google Colab**, uncomment and run the cell below.

**Note**: You have to repalce value of variable `YOUR_PATH_TO_DATASET_DIRECTORY` with the path where your dataset is placed in the Google Drive folder.



In [None]:
# from google.colab import drive
# drive.mount('/content/drive/')
# YOUR_PATH_TO_DATASET_DIRECTORY = "work/Applied_Data_Lab/phase_2"
# PATH = "/content/drive/MyDrive/"+YOUR_PATH_TO_DATASET_DIRECTORY+"/"
# PATH

In [35]:
from google.colab import drive
drive.mount('/content/drive/')
YOUR_PATH_TO_DATASET_DIRECTORY = "work/Applied_Data_Lab_Assignments/phase_2"
PATH = "/content/drive/MyDrive/"+YOUR_PATH_TO_DATASET_DIRECTORY+"/"
PATH

Mounted at /content/drive/


'/content/drive/MyDrive/work/Applied_Data_Lab_Assignments/phase_2/'

### Exercise 2: Read Data

Import the `laptops.csv` in variable `data`


In [None]:
# Do Exercise in this cell
#
#
#

In [45]:
#HINT
data = pd.read_csv(PATH+'laptops.csv', encoding="latin-1")

Display Data using head method

In [107]:
data.head(5)

Unnamed: 0,manufacturer,model_name,category,screen_size,screen,cpu,ram,storage,gpu,os,os_version,weight_kg,price_euros
0,Apple,MacBook Pro,Ultrabook,13.3,IPS Panel Retina Display 2560x1600,Intel Core i5 2.3GHz,8GB,128GB SSD,Intel Iris Plus Graphics 640,macOS,,1.37kg,1339.69
1,Apple,Macbook Air,Ultrabook,13.3,1440x900,Intel Core i5 1.8GHz,8GB,128GB Flash Storage,Intel HD Graphics 6000,macOS,,1.34kg,898.94
2,HP,250 G6,Notebook,15.6,Full HD 1920x1080,Intel Core i5 7200U 2.5GHz,8GB,256GB SSD,Intel HD Graphics 620,No OS,,1.86kg,575.0
3,Apple,MacBook Pro,Ultrabook,15.4,IPS Panel Retina Display 2880x1800,Intel Core i7 2.7GHz,16GB,512GB SSD,AMD Radeon Pro 455,macOS,,1.83kg,2537.45
4,Apple,MacBook Pro,Ultrabook,13.3,IPS Panel Retina Display 2560x1600,Intel Core i5 3.1GHz,8GB,256GB SSD,Intel Iris Plus Graphics 650,macOS,,1.37kg,1803.6


## Exercise: Clean Column Names

**Objective:** Clean and simplify column names in a DataFrame.

**Instructions:**

1. Get the column names using `data.columns`.
   - *Hint*: To retrieve column names, use the `col = data.columns`.

2. Clean the column names to make them more concise and meaningful:
   - Remove leading and trailing spaces from the names.
   - Convert the names to lowercase.
   - Replace spaces with underscores.
   - Remove any parentheses from the names.
   - *Hint*: You can use string manipulation methods like `.strip()`, `.lower()`, `.replace()`, and `.str` to apply these changes to the column names. `col = col.str.method()`

3. Replace a specific column name to be more informative:
   - Rename the column named 'operating_system' to 'os'.
   - *Hint*: Use the `.replace()` method to rename the column.

4. Print the cleaned and modified column names.
   - *Hint*: Use the `print()` function to display the cleaned column names.

5. Convert the modified column names to a list.
   - *Hint*: Use the `list()` function to convert the column names from an index to a list.

6. Modify a specific column name to include '_kg' as a suffix.
  - *Hint*: `col[-2] + "_kg"`

7. Set the cleaned column names as the new column names in the DataFrame using `data.columns = col`.
   - *Hint*: Assign the cleaned column names (variable `col`) back to the DataFrame's `.columns` attribute.


In [None]:
# Do Exercise in this cell
#
#
#

In [88]:
col = data.columns
col = col.str.strip()
col = col.str.lower()
col = col.str.replace(" ","_")
col = col.str.replace("operating_system","os")
col = col.str.replace("(","", regex=False) #regex false to remove warnings
col = col.str.replace(")","", regex=False) #regex false to remove warnings
col = col.str.replace("operating_system","os")
col = list(col) # as index immutable cannot be change
col[-2] = col[-2] + "_kg"
print(col)
data.columns = col

['manufacturer', 'model_name', 'category', 'screen_size', 'screen', 'cpu', 'ram', 'storage', 'gpu', 'os', 'os_version', 'weight_kg', 'price_euros']


First, check the data types of all attributes using the `.info()` method.

In [103]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1303 entries, 0 to 1302
Data columns (total 13 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   manufacturer  1303 non-null   object
 1   model_name    1303 non-null   object
 2   category      1303 non-null   object
 3   screen_size   1303 non-null   object
 4   screen        1303 non-null   object
 5   cpu           1303 non-null   object
 6   ram           1303 non-null   object
 7   storage       1303 non-null   object
 8   gpu           1303 non-null   object
 9   os            1303 non-null   object
 10  os_version    1133 non-null   object
 11  weight_kg     1303 non-null   object
 12  price_euros   1303 non-null   object
dtypes: object(13)
memory usage: 132.5+ KB


The columns such as 'screen size,' 'RAM,' 'weight_kg,' and 'price' currently store numeric values as strings with attached non-numeric characters. To perform data type conversion, we should first remove these non-numeric characters from the values.



### Exercise 3: Cleaning Values of Specific Columns

**Objective:** Clean the values in the 'price_euros,' 'screen_size,' 'weight_kg' and 'ram' columns.

**Instructions:**

In this exercise, you will perform data cleaning on specific columns within the dataset using Pandas. Follow these steps:

1. **Clean 'price_euros' Column:**
   - Use the `str.replace()` method to replace all commas (`,`) with periods (`.`) in the 'price_euros' column. This will ensure that prices are represented with decimal points instead of commas. Hint: `data["column"] = data["column"].str.replace(.....)`.

2. **Clean 'screen_size' Column:**
   - Use the `str.rstrip()` method to remove the double quotes (`"`) from the right side of values in the 'screen_size' column. This will ensure that screen sizes are represented as numeric values. Hint: `rstrip('"')`.

3. **Clean 'weight_kg' Column:**
   - Use the `str.replace()` method to remove the "kg" suffix from the values in the 'weight_kg' column. This will leave only the numeric weight values.

4. **Clean 'ram' Column:**
   - Use the `str.replace()` method to remove the "GB" suffix from the values in the 'ram' column. This will leave only the numeric values.

In [None]:
# Do Exercise in this cell
#
#
#

In [111]:
data["price_euros"] = data["price_euros"].str.replace(",",".")
data["screen_size"] = data["screen_size"].str.rstrip('"')
data["weight_kg"] = data["weight_kg"].str.replace('kg',"")
data["ram"] = data["ram"].str.replace('GB',"")

### Exercise 4: Changing Data Types

**Objective:** Convert price_euro, screen_size and ram columns in a dataset to appropriate data types.

**Instructions:**


1. To change the data types of specific columns:
   - Use the `.astype(float)` method on the DataFrame to convert the "price_euros" column to the float data type.
   - Use the `.astype(float)` method on the DataFrame to convert the "screen_size" column to the float data type.
   - Use the `.astype(int)` method on the DataFrame to convert the "ram" column to the integer data type.
   - Hint: `data["column"] = data["column"].astype(int)`.

2. Ensure that you overwrite the existing columns in the DataFrame with the new data types.

3. Display the DataFrame to check if the data type changes have been applied correctly.

In [134]:
data["price_euros"] = data["price_euros"].astype(float)
data["screen_size"] = data["screen_size"].astype(float)
data["ram"] = data["ram"].astype(int)

weight_kg will give error when changing its data  type as one row has abnormal input that is s
so for you we already use this code of line data[data["weight_kg"].apply(lambda x:  x.find("s") != -1)] to detect the row and change it for you

In [None]:
display ( data[data["weight_kg"].apply(lambda x:  x.find("s") != -1)] )
data.loc[1061, "weight_kg"] = '4'
data["weight_kg"] = data["weight_kg"].astype(float)

In [135]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1303 entries, 0 to 1302
Data columns (total 13 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   manufacturer  1303 non-null   object 
 1   model_name    1303 non-null   object 
 2   category      1303 non-null   object 
 3   screen_size   1303 non-null   float64
 4   screen        1303 non-null   object 
 5   cpu           1303 non-null   object 
 6   ram           1303 non-null   int64  
 7   storage       1303 non-null   object 
 8   gpu           1303 non-null   object 
 9   os            1303 non-null   object 
 10  os_version    1133 non-null   object 
 11  weight_kg     1303 non-null   float64
 12  price_euros   1303 non-null   float64
dtypes: float64(3), int64(1), object(9)
memory usage: 132.5+ KB


### Exercise 4: Normalize Frequency Result

**Objective:** Normalize the frequency values of sectors in the DataFrame and display the normalized results.

**Instructions:**

1. Use `value_counts()` method to calculate the frequency of sector values from the 'Sector' column and save the result in a variable called `sector_freq`.

2. Divide the `sector_freq` Series by the total number of rows in the DataFrame (`len(data)`).

**Hint**: `sector_freq / len(data)`

3. Print the normalized frequency values.

By normalizing the frequency values, you'll get a sense of the proportion of each sector within the dataset.

In [None]:
# Do Exercise in this cell
#
#
#

### Exercise 5: Normalize Frequency Result Using the 'normalize' Argument

**Objective:** Normalize the frequency values of sectors in the DataFrame using the `normalize` argument in the `value_counts()` method and display the normalized results.

**Instructions:**

1. Use Pandas' `value_counts()` method on the 'Sector' column of the DataFrame. Pass the argument `normalize=True` to the method. This will calculate and return the normalized frequency values directly.

**Hint**: `value_counts(normalize=True)`

2. Print the `sector_freq_normalized` Series to display the normalized frequency values.

Using the `normalize=True` argument simplifies the process of obtaining normalized frequency values from the 'Sector' column.

In [None]:
# Do Exercise in this cell
#
#
#

## Selecting Rows Using Boolean Mask/Indexing

You can select specific rows from a DataFrame using a boolean mask or indexing. A boolean mask is essentially a series with the same length as the DataFrame, containing `True` and `False` values. When you apply this mask to the DataFrame, it selects rows where the mask is `True`.

Here's an example to illustrate how this works:

```python
# Import Pandas
import pandas as pd

# Sample DataFrame
person = {'Name': ['Alice', 'Bob', 'Charlie', 'David'],
        'Age': [25, 30, 22, 35]}
df = pd.DataFrame(person)

# Create a boolean mask to select rows where 'Age' is greater than 25
mask = [True, False, False, True]

# Apply the mask to the DataFrame
selected_rows = df[mask]

print(selected_rows)
```

In this example, we've created a boolean mask (`mask`) where `True` values indicate the rows to select (in this case, the first and last rows). When the mask is applied to the DataFrame, it filters the rows accordingly.

In [None]:
# Run this cell
person = {'Name': ['Alice', 'Bob', 'Charlie', 'David'],
        'Age': [25, 30, 22, 35]}
df = pd.DataFrame(person)
print("Data")
display ( df )

mask = [True, False, False, True]
selected_rows = df[mask]
selected_rows
print("\n\nSelected Data")
display ( selected_rows )

## Comparison Operators

Comparison Operators in Pandas are used to filter rows in a DataFrame based on certain conditions. These operators are similar to the ones used in Python. Here's a summary of common comparison operators and their descriptions:

| Operator | Description                  |
|----------|------------------------------|
| `<`      | Less than                    |
| `>`      | Greater than                 |
| `<=`     | Less than or equal to       |
| `>=`     | Greater than or equal to    |
| `!=`     | Not equal to                 |
| `==`     | Equal to                     |

You can use these operators to create conditions and filter rows in a DataFrame accordingly. For example, you can filter rows where a particular column is greater than a specific value or where two columns are not equal.



In [None]:
# Run this cell
df['Age'] > 25

This returns a boolean mask for records where the age is greater than 25. You can use this mask and pass it to the DataFrame (`df`) to retrieve the corresponding records.




In [None]:
# Run this cell
mask = df['Age'] > 25
df[mask]

Alternate

In [None]:
# Run this cell
df[ df['Age'] > 25 ]

### Exercise 6: Selecting Records with Revenues Greater Than Median

**Objective:** Calculate the median of revenues and select records with revenues greater than the median.

**Instructions:**

1. Calculate the median of the 'revenues' column using the `median()` method and save it in a variable named `median_revenues`.

**Hint:** `data['col'].meadian()`

2. Apply a filter to the DataFrame to select records with revenues greater than the median.

**Hint:** `data[ data['col'] > x ]`

3. Print the `selected_records` DataFrame to display the records with revenues greater than the median.

In [None]:
# Do Exercise in this cell
#
#
#

### Exercise 7: Selecting Records with Revenues Equal & Less Than Mean

**Objective:** Calculate the mean of revenues and select records with revenues equal and less than the mean.

**Instructions:**

1. Calculate the mean of the 'revenues' column using the `mean()` method and save it in a variable named `mean_revenues`.

2. Apply a filter to the DataFrame to select records with revenues equal and less than the median.

3. Print the `selected_records` DataFrame to display the records with revenues greater than the median.

In [None]:
# Do Exercise in this cell
#
#
#

## Logical Operators in Pandas

Pandas also supports logical operators that allow you to create complex conditions for filtering rows in a DataFrame. Here are the common logical operators and their descriptions:

- `&` (AND): Combines two or more conditions and returns `True` if all conditions are `True`.
- `|` (OR): Combines two or more conditions and returns `True` if at least one condition is `True`.
- `~` (NOT): Negates a condition, returning `True` if the condition is `False`.

You can use these logical operators to create compound conditions for row selection in a DataFrame. For example, you can filter rows where a specific column meets multiple conditions, or you can exclude rows based on a particular condition.

Here's an example of using logical operators in Pandas:

In [None]:
# Run this cell

person = {'Name': ['Alice', 'Bob', 'Charlie', 'David'],
        'Age': [25, 30, 22, 35],
        'Salary': [50000, 60000, 45000, 70000]}
df = pd.DataFrame(person)

# Create a condition to filter rows where 'Age' is greater than 25 and 'Salary' is less than 60000
condition = (df['Age'] > 25) & (df['Salary'] < 70000)

# Apply the condition to select rows
selected_rows = df[condition]

print(selected_rows)

In this example, the logical operators `&` are used to create a compound condition, and the resulting DataFrame `selected_rows` contains rows that satisfy both conditions.

Using logical operators allows you to perform advanced row filtering based on multiple criteria in your DataFrame.

### Exercise 8: Select Records Based on Two Constraints

**Objective:** Select records from the DataFrame where the headquarters location is `'Beijing, China'` and revenues are greater than the mean of revenues.

**Instructions:**

1. Calculate the mean of revenues using the `mean()` method on the 'Revenues' column. Store the result in a variable named `mean_revenues`.

2. Create a condition to filter rows where 'HQ_Location' is equal to `'Beijing, China'` and 'Revenues' are greater than `mean_revenues`. You can use logical operators (`&` **AND gate**) to combine these conditions.

3. Apply the condition to select rows that meet both constraints and store the result in a new DataFrame called `selected_records`.

4. Print the `selected_records` DataFrame to display the records that satisfy both constraints.

In [None]:
# Do Exercise in this cell
#
#
#

### Exercise 9: Sort Data by Single Column

**Objective:** Sort the data by the 'Total_Stockholder_Equity' column in ascending order.

**Instructions:**

1. Use the `sort_values()` method on the DataFrame to sort the data by the 'Total_Stockholder_Equity' column in ascending order (`ascending=True`).

**Hint:** `data.sort_values(by='column_name', ascending=True/False)`

2. Save the sorted DataFrame in a variable named `sorted_data`.

3. Print the `sorted_data` DataFrame to display the data sorted by total stockholder equity in ascending order.

In [None]:
# Do Exercise in this cell
#
#
#

### Exercise 10: Select Records Having Maximum Total Stockholder Equity for Each Country

**Objective:** Select records for each country where the total stockholder equity is maximum within that country.

**Instructions:**

In this exercise, you will use the `groupby` method in pandas to group the data by the 'country' column and then filter records where the total stockholder equity is maximum within each country. Here are the steps:

1. **Group Data by Country:**
   - Use the `groupby` method to group the data by the 'country' column.
   - Apply the `['total_stockholder_equity']` indexer to select the 'total_stockholder_equity' column for each group.

   ```python
   max_group_equity = data.groupby('country')['total_stockholder_equity']
   ```

2. **Find the Maximum Equity in Each Group:**
   - Use the `transform` method to find the maximum total stockholder equity within each group (country).

   ```python
   max_equity_by_country = max_group_equity.transform('max')
   ```

3. **Filter Records with Maximum Equity:**
   - Create a boolean mask by comparing the 'total_stockholder_equity' column with the 'max_equity_by_country' series.
   - Use this mask to filter the records where the total stockholder equity is maximum within each country.

   ```python
   filtered_data = data[data['total_stockholder_equity'] == max_equity_by_country]
   ```

4. **Result:**
   - The `filtered_data` DataFrame will contain records with the maximum total stockholder equity for each country.

   ```python
   filtered_data[['country','company']]
   ```

By following these steps, you can easily select records with the greatest total stockholder equity within each country in the dataset.


**Note:** We won't delve into a detailed explanation of the `groupby` and `transform` methods at this point, as they are more advanced topics. However, you can use these methods to efficiently group and perform calculations on data within specific groups in a pandas DataFrame.

In [None]:
# Do Exercise in this cell
#
#
#