# Applied Data Lab

# Assignment 04: Data Cleaning

## Setting Up the Address
In this cell, a path variable is set with the value of the current directory where the notebook is open. This is done to easily upload the dataset file from this location.

In [1]:
import pandas as pd

In [None]:
# Run this cell
import os
PATH = os.getcwd() + '/'
PATH

**ONLY FOR GOOGLE COLAB USERS**

For those who are using **Google Colab**, uncomment and run the cell below.

**Note**: You have to repalce value of variable `YOUR_PATH_TO_DATASET_DIRECTORY` with the path where your dataset is placed in the Google Drive folder.



In [None]:
# from google.colab import drive
# drive.mount('/content/drive/')
# YOUR_PATH_TO_DATASET_DIRECTORY = "work/Applied_Data_Lab/phase_2"
# PATH = "/content/drive/MyDrive/"+YOUR_PATH_TO_DATASET_DIRECTORY+"/"
# PATH

Importing the `laptops.csv` file into the `data` variable and performing some preprocessing for you from the previous assignment.


In [None]:
# Run this cell

# Read the CSV file, strip and lowercase the column names, and remove parentheses
data = pd.read_csv(PATH+'laptops.csv', encoding="latin-1").rename(
    columns=lambda x: x.strip().lower().replace("(", "").replace(")", "")
    .replace("operating system", "OS").replace(" ", "_")
)

# Convert the screen size and RAM columns to numerics
data["screen_size"] = data["screen_size"].str.rstrip('"').astype(float)
data["ram"] = data["ram"].str.rstrip("GB").astype(int)

# Add the cpu_manufacturer column
data.insert(6,"cpu_manufacturer", data["cpu"].str.split(n=1, expand=True).iloc[:, 0])

# Display the first two rows of the DataFrame
data.head(2)

## Exercise 1: Checking Null Values in Each Column

**Objective:** Check for null values in a DataFrame using the `isnull() `method.

**Instructions:**

1. Check for null values in each column of the DataFrame using the `isnull()` method.
2. Apply the `sum()` method after calling `isnull()` to return the total count of null values in each row.
3. Print the total count of null values in each row.

In [None]:
# Do Exercise in this cell
#
#
#

We now know that there are null entries in the `OS_version` column after checking the sum of the null values.

## Exercise 2: Print/Display Unique Set of Values

**Objective:** Print the unique set of values of `OS` and `OS_version` columns of the DataFrame.

**Instructions:**

1. Check for the unique values of `OS` and `OS_version` columns using the unique() method.
2. Print the unique values of `OS` and `OS_version` columns using the print() function.

OUTPUT:
```
Unique OS values: ['macOS' 'No OS' 'Windows' 'Mac OS' 'Linux' 'Android' 'Chrome OS']
Unique OS version values: [nan '10' 'X' '10 S' '7']
```

In [None]:
# Do Exercise in this cell
#
#
#

Checking for unique values in the `OS_version` column can also help us to identify rows with null values in that column. We can then create a filter for only those rows and pass it to the `OS` column to see which operating systems are missing OS version information.

## Exercise 3: Count Unique values

**Objective:** Count the unique values of `os` for rows where the `os_version` is null.

**Instructions:**

1. Create a mask for rows where the `os_version` is null using the `isnull()` method.
2. Filter the `OS` column using the mask to only include rows where the `OS_version` is null.
3. Count the unique values of the filtered `OS` column using the `value_counts()` method.
4. Print the count of unique values.

OUTPUT:

```
Count of unique OS values for rows with null OS_version:
No OS        66
Linux        62
Chrome OS    27
macOS        13
Android       2
Name: OS, dtype: int64
```


In [None]:
# Do Exercise in this cell
#
#
#

## Exercise 4: Cleaning Values of Specific Columns

**Heading:** Cleaning Values of Specific Columns (OS_version)

**Objective:** Assign the value `'unknown'` to any null values in the `OS_version` column and print the value counts of `OS_version` column.

**Instructions:**

1. Create a mask for rows where the `OS_version` column is null using the `isnull()` method or just use previous mask in exercise 3.

2. Assign the value `'unknown'` to any null values in the `OS_version` column:

**HINT:** `data[mask][column] = 'unknown'`

3. Print the value counts of the `OS_version` column:

**Output:**

```
Count of unique OS_version:
10         1072
unknown     170
7            45
X             8
10 S          8
Name: OS_version, dtype: int64
```

In [None]:
# Do Exercise in this cell
#
#
#

### Exercise 4: Changing Data Types

**Objective:** Convert price_euro, screen_size and ram columns in a dataset to appropriate data types.

**Instructions:**


1. To change the data types of specific columns:
   - Use the `.astype(float)` method on the DataFrame to convert the "price_euros" column to the float data type.
   - Use the `.astype(float)` method on the DataFrame to convert the "screen_size" column to the float data type.
   - Use the `.astype(int)` method on the DataFrame to convert the "ram" column to the integer data type.
   - Hint: `data["column"] = data["column"].astype(int)`.

2. Ensure that you overwrite the existing columns in the DataFrame with the new data types.

3. Display the DataFrame to check if the data type changes have been applied correctly.

In [None]:
# Do Exercise in this cell
#
#
#

## Exercise 5: Saving Clean Dataset as csv format

`data.to_csv(location_path, index=False)` is used to save the DataFrame data to a CSV file at the specified location location_path, without saving the index. The index argument is set to False to ensure that the index is not saved, as this is not standard for CSV files. Additionally, Excel files already have an index by default, so saving the index from the DataFrame to the CSV file would create a duplicate index.

In [None]:
# Run this cell

# Use this instead PATH+'clean_laptop.csv' instead of location path
PATH+'clean_laptop.csv'

In [None]:
# Do Exercise in this cell
#
#
#