# Cleaning Data with Pandas Exercises

For the exercises, you will be cleaning data in the Women's Clothing E-Commerce Reviews dataset.

To start cleaning data, we first need to create a dataframe from the CSV and print out any relevant info to make sure our dataframe is ready to go.

In [None]:
# Import pandas and any other libraries you need here.
import pandas as pd
import numpy as np
# Create a new dataframe from your CSV

df = pd.read_csv(r"C:\Users\pered\OneDrive\Documents\Launch code\Data Analysis\github\github repo\data-analysis-projects\cleaning-data-with-pandas\exercises\Womens Clothing E-Commerce Reviews.csv")
df = df.rename(columns={'Unnamed: 0': 'Index'})

df.set_index("Index",inplace = True)

df

In [None]:
# Print out any information you need to understand your dataframe

This code imports the required libraries, loads the dataset into a DataFrame, renames the unnamed index column to “Index,” sets it as the DataFrame index, 
and finally displays the DataFrame to help understand its structure.

## Missing Data

Try out different methods to locate and resolve missing data.

In [None]:
# Try to find some missing data!
df.isna().sum()
print(df)

Did you find any missing data? What things worked well for you and what did not?

In [None]:
# Respond to the above questions here:
Clothing ID                   0
Age                           0
Title                      3810
Review Text                 845
Rating                        0
Recommended IND               0
Positive Feedback Count       0
Division Name                14
Department Name              14
Class Name                   14
dtype: int64

There are missing values present in the Title, Review Text, Division Name, Department Name, and Class Name columns.

## Irregular Data

With missing data out of the way, turn your attention to any outliers. Just as we did for missing data, we first need to detect the outliers.

In [None]:
# Keep an eye out for outliers!

print("Old Shape: ", df.shape)

# Compute IQR
Q1 = np.percentile(df['Age'], 25, method='midpoint')
Q3 = np.percentile(df['Age'], 75, method='midpoint')
IQR = Q3 - Q1

lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR

# Find outliers
upper_array = np.where(df['Age'] >= upper)[0]
lower_array = np.where(df['Age'] <= lower)[0]

print("Upper outliers:", len(upper_array))
print("Lower outliers:", len(lower_array))

# Remove outliers
outliers = np.concatenate([upper_array, lower_array])
df_clean = df.drop(outliers)

print("New Shape:", df_clean.shape)


What techniques helped you find outliers? In your opinion, what about the techniques you used made them effective?

In [None]:
# Make your notes here:

To identify outliers, I used the Interquartile Range (IQR) method, which is an effective approach for detecting extreme values.

## Unnecessary Data

Unnecessary data could be irrelevant to your analysis or a duplice column. Check out the dataset to see if there is any unnecessary data.

In [None]:
# Look out for unnecessary data!
In this dataset, the Review Text and Title columns are not necessary for the analysis.

Did you find any unnecessary data in your dataset? How did you handle it?

In [None]:
# Make your notes here.
In this dataset, the Review Text and Title columns are not necessary for the analysis, so they can be removed using the df.drop() function.

## Inconsistent Data

Inconsistent data is likely due to inconsistent formatting and can be addressed by re-formatting all values in a column or row.

In [None]:
# Look out for inconsistent data!
df.isnull().sum()
I used isnull() function to find the null values.

Output:
Clothing ID                   0
Age                           0
Rating                        0
Recommended IND               0
Positive Feedback Count       0
Division Name                14
Department Name              14
Class Name                   14
dtype: int64
I have missing values in the  Division Name, Department Name, and Class Name columns.

Did you find any inconsistent data? What did you do to clean it?

In [None]:
# Make your notes here!
To clean the dataset, I handled the missing values by replacing them with "Unknown" using the fillna() function:
df['Division Name'] = df['Division Name'].fillna('Unknown')
df['Department Name'] = df['Department Name'].fillna('Unknown')
df['Class Name'] = df['Class Name'].fillna('Unknown')


df.isnull().sum()

Output:
Clothing ID                0
Age                        0
Rating                     0
Recommended IND            0
Positive Feedback Count    0
Division Name              0
Department Name            0
Class Name                 0
dtype: int64