# Applied Data Lab

# Assignment 03: Data Cleaning

## String Operations

String operations are used to manipulate text data in various ways.









### `strip()`

One common operation is removing spaces at the beginning or end of a string, which can be done using the `.strip()` method. This method effectively trims leading and trailing spaces and returns the modified string.

**Example:**

```python
# Remove leading and trailing spaces from a string
text = "  hello  "
result = text.strip()
print(result)  # Output: "hello"
```

If an argument is passed, it specifies which characters to remove from the beginning and end of the string.

**Example:**

```python
# Remove leading and trailing passed argument from a string
text = "???text with spaces???"
result = text.strip("?")
print(result)  # Output: "text with spaces"
```


In [148]:
# Run Example
#
#

### `rstrip()`

The `rstrip()` method is used to remove trailing (right-hand) characters or a sequence of characters from a string. You pass the characters you want to remove as an argument to this method. It will remove those characters from the end of the string, starting from the right side.

**Example:**

```python
# Remove trailing question marks from the end of a string
text = "???text with spaces???"
result = text.rstrip("?")
print(result)  # Output: "???text with spaces"
```

### `lstrip()`

The `lstrip()` method is used to remove leading (left-hand) characters or a sequence of characters from a string. You pass the characters you want to remove as an argument to this method. It will remove those characters from the beginning of the string, starting from the left side.

**Example:**

```python
# Remove leading question marks from the beginning of a string
text = "???text with spaces???"
result = text.lstrip("?")
print(result)  # Output: "text with spaces???"
```

In [None]:
# Run These Examples
#
#

### `lower()`

You can convert text to lowercase using the `.lower()` method. This operation is especially useful for making text data consistent and easier to work with. It transforms all characters in a string to their lowercase form.

**Example:**

```python
# Convert text to lowercase
text = "Hello World"
lowercase_text = text.lower()
print(lowercase_text)  # Output: "hello world"
```

In [None]:
# Run Example
#
#

### `replace()`

The `.replace()` method in pandas is used to replace a specified substring or character in a string with another substring. This can be helpful for cleaning and modifying text data.

**Example:**

```python
# Replace a substring in a string
text = "I like cats, but some people like dogs."
modified_text = text.replace("cats", "hamsters")
print(modified_text)  # Output: "I like hamsters, but some people like dogs."
```

In [None]:
# Run Example
#
#

### Exercise 1: Cleaning Text

**Objective:** Clean and preprocess the given text.

```python
text = "    A warden, a powerful but evadable $hostile mob $, summoned by sculk shriekers in the deep dark biome, attacks by swinging its arms downward. It deals the highest melee damage among all mobs and can also release a sonic boom attack. This attack homes in on the target, pierces obstacles, and only the Resistance effect can reduce its damage. Wardens are blind and rely on vibrations, smell, and touch to detect players and mobs for attack. They can be evaded via sneaking, diversions, and wool.????       "
```

**Instructions:**

1. Apply the `.strip()` method to remove leading and trailing spaces from the text.

2. Apply the `.lower()` method to convert the text to lowercase.

3. Use the `.replace()` method to remove any dollar signs ('$') by replacing them with an empty string ('').

4. Use the `.replace()` method again to remove any commas (',') and periods ('.') by replacing them with spaces.

5. Apply the `.split()` method to convert the cleaned text into a list of words.

6. Convert the list of words into a set to remove duplicate words.

**HINT:** `set(list_of_words)`
A set effectively eliminates duplicate values, retaining only unique elements from the list.

7. Convert all the values of set back to string using `" ".join(set_of_words)`, then print it.


In [28]:
# Do Exercise in this cell
#
#
#
text = "    A warden, a powerful but evadable $hostile mob $, summoned by sculk shriekers in the deep dark biome, attacks by swinging its arms downward. It deals the highest melee damage among all mobs and can also release a sonic boom attack. This attack homes in on the target, pierces obstacles, and only the Resistance effect can reduce its damage. Wardens are blind and rely on vibrations, smell, and touch to detect players and mobs for attack. They can be evaded via sneaking, diversions, and wool.????       "

## Setting Up the Address
In this cell, a path variable is set with the value of the current directory where the notebook is open. This is done to easily upload the dataset file from this location.

In [42]:
import pandas as pd

In [None]:
# Run this cell
import os
PATH = os.getcwd() + '/'
PATH

**ONLY FOR GOOGLE COLAB USERS**

For those who are using **Google Colab**, uncomment and run the cell below.

**Note**: You have to repalce value of variable `YOUR_PATH_TO_DATASET_DIRECTORY` with the path where your dataset is placed in the Google Drive folder.



In [None]:
# from google.colab import drive
# drive.mount('/content/drive/')
# YOUR_PATH_TO_DATASET_DIRECTORY = "work/Applied_Data_Lab/phase_2"
# PATH = "/content/drive/MyDrive/"+YOUR_PATH_TO_DATASET_DIRECTORY+"/"
# PATH

In [35]:
from google.colab import drive
drive.mount('/content/drive/')
YOUR_PATH_TO_DATASET_DIRECTORY = "work/Applied_Data_Lab_Assignments/phase_2"
PATH = "/content/drive/MyDrive/"+YOUR_PATH_TO_DATASET_DIRECTORY+"/"
PATH

Mounted at /content/drive/


'/content/drive/MyDrive/work/Applied_Data_Lab_Assignments/phase_2/'

### Exercise 2: Read Data

Import the `laptops.csv` in variable `data`


In [None]:
# Do Exercise in this cell
#
#
#

In [145]:
#HINT
data = pd.read_csv(PATH+'laptops.csv', encoding="latin-1")

Display Data using head method

## Exercise: Clean Column Names

**Objective:** Clean and simplify column names in a DataFrame.

**Instructions:**

1. Get the column names using `data.columns`.
   - *Hint*: To retrieve column names, use the `col = data.columns`.

2. Clean the column names to make them more concise and meaningful:
   - Remove leading and trailing spaces from the names.
   - Convert the names to lowercase.
   - Replace spaces with underscores.
   - Remove any parentheses from the names.
   - *Hint*: You can use string manipulation methods like `.strip()`, `.lower()`, `.replace()`, and `.str` to apply these changes to the column names. `col = col.str.method()`

3. Replace a specific column name to be more informative:
   - Rename the column named 'operating_system' to 'os'.
   - *Hint*: Use the `.replace()` method to rename the column.

4. Print the cleaned and modified column names.
   - *Hint*: Use the `print()` function to display the cleaned column names.

5. Convert the modified column names to a list.
   - *Hint*: Use the `list()` function to convert the column names from an index to a list.

6. Modify a specific column name to include '_kg' as a suffix.
  - *Hint*: `col[-2] + "_kg"`

7. Set the cleaned column names as the new column names in the DataFrame using `data.columns = col`.
   - *Hint*: Assign the cleaned column names (variable `col`) back to the DataFrame's `.columns` attribute.


In [None]:
# Do Exercise in this cell
#
#
#

First, check the data types of all attributes using the `.info()` method.

The columns such as 'screen size,' 'RAM,' 'weight_kg,' and 'price' currently store numeric values as strings with attached non-numeric characters. To perform data type conversion, we should first remove these non-numeric characters from the values.



### Exercise 3: Cleaning Values of Specific Columns

**Objective:** Clean the values in the 'price_euros,' 'screen_size,' 'weight_kg' and 'ram' columns.

**Instructions:**

In this exercise, you will perform data cleaning on specific columns within the dataset using Pandas. Follow these steps:

1. **Clean 'price_euros' Column:**
   - Use the `str.replace()` method to replace all commas (`,`) with periods (`.`) in the 'price_euros' column. This will ensure that prices are represented with decimal points instead of commas. Hint: `data["column"] = data["column"].str.replace(.....)`.

2. **Clean 'screen_size' Column:**
   - Use the `str.rstrip()` method to remove the double quotes (`"`) from the right side of values in the 'screen_size' column. This will ensure that screen sizes are represented as numeric values. Hint: `rstrip('"')`.

3. **Clean 'weight_kg' Column:**
   - Use the `str.replace()` method to remove the "kg" suffix from the values in the 'weight_kg' column. This will leave only the numeric weight values.

4. **Clean 'ram' Column:**
   - Use the `str.replace()` method to remove the "GB" suffix from the values in the 'ram' column. This will leave only the numeric values.

In [None]:
# Do Exercise in this cell
#
#
#

In [140]:
data["price_euros"] = data["price_euros"].str.replace(",",".")
data["screen_size"] = data["screen_size"].str.rstrip('"')
data["weight_kg"] = data["weight_kg"].str.replace('kg',"")
data["ram"] = data["ram"].str.replace('GB',"")

### Exercise 4: Changing Data Types

**Objective:** Convert price_euro, screen_size and ram columns in a dataset to appropriate data types.

**Instructions:**


1. To change the data types of specific columns:
   - Use the `.astype(float)` method on the DataFrame to convert the "price_euros" column to the float data type.
   - Use the `.astype(float)` method on the DataFrame to convert the "screen_size" column to the float data type.
   - Use the `.astype(int)` method on the DataFrame to convert the "ram" column to the integer data type.
   - Hint: `data["column"] = data["column"].astype(int)`.

2. Ensure that you overwrite the existing columns in the DataFrame with the new data types.

3. Display the DataFrame to check if the data type changes have been applied correctly.

In [141]:
data["price_euros"] = data["price_euros"].astype(float)
data["screen_size"] = data["screen_size"].astype(float)
data["ram"] = data["ram"].astype(int)

**Handling Abnormal Data in the "weight_kg" Column**

In the dataset, the "weight_kg" column presented an issue with one row containing the letter 's'. This anomaly impeded the straightforward conversion of the entire column to the float data type, leading to errors. However, we've addressed this issue with the following steps:

1. **Identification:** We pinpointed the row(s) containing the 's' value using the code snippet: `data[data["weight_kg"].apply(lambda x: x.find("s") != -1)]`. This code locates the row that necessitates correction.

2. **Resolution:** To correct this anomaly, we utilized the `replace` method. We specified the character 's' and replaced it with an empty character `''`. This correction ensures that the "weight_kg" column exclusively contains valid data that can be converted into numeric.

In [142]:
# Run this cell
display ( data[data["weight_kg"].apply(lambda x:  x.find("s") != -1)] )

Unnamed: 0,manufacturer,model_name,category,screen_size,screen,cpu,ram,storage,gpu,os,os_version,weight_kg,price_euros
1061,Asus,Rog G752VL-GC088D,Gaming,17.3,IPS Panel Full HD 1920x1080,Intel Core i7 6700HQ 2.6GHz,16,1TB HDD,Nvidia GeForce GTX 965M,No OS,,4s,998.0


In [143]:
# Run this cell
data["weight_kg"] = data["weight_kg"].str.replace('s', '4')
data["weight_kg"] = data["weight_kg"].astype(float)

In [144]:
# Run this cell
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1303 entries, 0 to 1302
Data columns (total 13 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   manufacturer  1303 non-null   object 
 1   model_name    1303 non-null   object 
 2   category      1303 non-null   object 
 3   screen_size   1303 non-null   float64
 4   screen        1303 non-null   object 
 5   cpu           1303 non-null   object 
 6   ram           1303 non-null   int64  
 7   storage       1303 non-null   object 
 8   gpu           1303 non-null   object 
 9   os            1303 non-null   object 
 10  os_version    1133 non-null   object 
 11  weight_kg     1303 non-null   float64
 12  price_euros   1303 non-null   float64
dtypes: float64(3), int64(1), object(9)
memory usage: 132.5+ KB
