# Cleaning Data with Pandas Exercises

For the exercises, you will be cleaning data in the Women's Clothing E-Commerce Reviews dataset.

**Dataset Information:**
- **Dataset Name:** Women's Clothing E-Commerce Reviews
- **File:** `Womens Clothing E-Commerce Reviews.csv`
- **Source:** This dataset contains reviews written by customers and includes features like ratings, review text, product categories, and customer information.

To start cleaning data, we first need to create a dataframe from the CSV and print out any relevant info to make sure our dataframe is ready to go.

In [1]:
# Import pandas and any other libraries you need here.
import pandas as pd
import numpy as np
# Create a new dataframe from your CSV
womens_clothing_df = pd.read_csv("Womens Clothing E-Commerce Reviews.csv")

In [12]:
# Print out any information you need to understand your dataframe
womens_clothing_df.head()
#womens_clothing_df.columns
#womens_clothing_df.describe()

Unnamed: 0.1,Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name
0,0,767,33,,Absolutely wonderful - silky and sexy and comf...,4,1,0,Initmates,Intimate,Intimates
1,1,1080,34,,Love this dress! it's sooo pretty. i happene...,5,1,4,General,Dresses,Dresses
2,2,1077,60,Some major design flaws,I had such high hopes for this dress and reall...,3,0,0,General,Dresses,Dresses
3,3,1049,50,My favorite buy!,"I love, love, love this jumpsuit. it's fun, fl...",5,1,0,General Petite,Bottoms,Pants
4,4,847,47,Flattering shirt,This shirt is very flattering to all due to th...,5,1,6,General,Tops,Blouses


## Missing Data

Try out different methods to locate and resolve missing data.

In [13]:
# Try to find some missing data!
#womens_clothing_df.isna().sum()
womens_clothing_df["Department Name"].value_counts(dropna = False)

Department Name
Tops        10468
Dresses      6319
Bottoms      3799
Intimate     1735
Jackets      1032
Trend         119
NaN            14
Name: count, dtype: int64

Did you find any missing data? What things worked well for you and what did not?

In [None]:
# Respond to the above questions here: isna().sum() helped me find which columns had missing data. Using .value_counts(dropna = False) was somewhat useful in seeing what type of null value was there, but was tedious to do for each column.

## Irregular Data

With missing data out of the way, turn your attention to any outliers. Just as we did for missing data, we first need to detect the outliers.

In [29]:
# Keep an eye out for outliers!
womens_clothing_df[["Positive Feedback Count"]].value_counts()


Positive Feedback Count
0                          11176
1                           4043
2                           2193
3                           1433
4                            922
                           ...  
98                             1
99                             1
108                            1
117                            1
122                            1
Name: count, Length: 82, dtype: int64

What techniques helped you find outliers? In your opinion, what about the techniques you used made them effective?

In [None]:
# Make your notes here: I used .value_count() on specific columns to see if there were any values that were significant values, which worked fairly well. I aslo tried using .describe(), but that didn't give as clear info on if a max or min value was a significant outlier.

## Unnecessary Data

Unnecessary data could be irrelevant to your analysis or a duplicate column. Check out the dataset to see if there is any unnecessary data.

In [42]:
# Look out for unnecessary data!
#womens_clothing_df.head(30)
womens_clothing_df = womens_clothing_df.drop(columns="Unnamed: 0")
#womens_clothing_df.duplicated().sum()

Did you find any unnecessary data in your dataset? How did you handle it?

In [43]:
# Make your notes here: The first column "Unnamed 0" seemed unneccesary because it was the same as the index. I deleted it using .drop(). I also checked for any duplicates but there were none. 

## Inconsistent Data

Inconsistent data is likely due to inconsistent formatting and can be addressed by re-formatting all values in a column or row.

In [None]:
# Look out for inconsistent data!
womens_clothing_df["Department Name"].value_counts(dropna= False)

Rating
5    13131
4     5077
3     2871
2     1565
1      842
Name: count, dtype: int64

Did you find any inconsistent data? What did you do to clean it?

In [None]:
# Make your notes here! I checked multiple columns for inconsistent formatting and didn't find any...was this a trick question?