# Cleaning Data with Pandas Exercises

For the exercises, you will be cleaning data in the Women's Clothing E-Commerce Reviews dataset.

To start cleaning data, we first need to create a dataframe from the CSV and print out any relevant info to make sure our dataframe is ready to go.

In [None]:
# Import pandas and any other libraries you need here.
import pandas as pd
import numpy as np
# Create a new dataframe from your CSV
df = pd.read_csv("Womens_Clothing_ECommerce_Reviews.csv")

In [None]:
# Print out any information you need to understand your dataframe
df.head(100)
df.sample(25)

## Missing Data

Try out different methods to locate and resolve missing data.

In [None]:
# Try to find some missing data!
missing_values_count = df['Title'].value_counts(dropna=False)
print(f"Number of missing values in 'Title': {missing_values_count}")

Did you find any missing data? What things worked well for you and what did not?

In [None]:
# Respond to the above questions here:
#Using the value count and dropna functions I was able to identify 3810 items that did not have a label.

## Irregular Data

With missing data out of the way, turn your attention to any outliers. Just as we did for missing data, we first need to detect the outliers.

In [None]:
# Keep an eye out for outliers!
df['Positive Feedback Count'].value_counts()
outliers = np.where((df['Positive Feedback Count'] > 100))
print(outliers)

df['Positive Feedback Count'].range()

What techniques helped you find outliers? In your opinion, what about the techniques you used made them effective?

In [None]:
# Make your notes here:
#The value count function helped me to see that there were a few clothing items with a lot of positive feedback the max being 122. On average clothing items recieve 2 -3 positive feedback responses. 

## Unnecessary Data

Unnecessary data could be irrelevant to your analysis or a duplice column. Check out the dataset to see if there is any unnecessary data.

In [None]:
# Look out for unnecessary data!
df.drop(columns=['Unnamed: 0'])

Did you find any unnecessary data in your dataset? How did you handle it?

In [None]:
# Make your notes here.
#Using the .sample function I noticed there was a duplicate index/id column so I dropped it using the drop function.

## Inconsistent Data

Inconsistent data is likely due to inconsistent formatting and can be addressed by re-formatting all values in a column or row.

In [None]:
# Look out for inconsistent data!
df["Title"] = df["Title"].str.strip().str.upper()
df["Title"] = df["Title"].fillna("UNTITLED")
df.head(25)

Did you find any inconsistent data? What did you do to clean it?

In [None]:
# Make your notes here!
#I stripped any potential whitespace and capitalized the title. In addition, I replaced the Null values with "Untitled".