<a href="https://colab.research.google.com/github/ustab/GUIDED_PROJECTS_STUDIES/blob/main/CLEANING.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#######This guide will explain the basics of what data cleaning is, then jump into the real stuff. Downstream, this guide will transform into a how-to for data cleaning with Python walking you through step by step. 

1. What is Data Cleaning?
Data cleaning is the process of correcting or removing corrupt, incorrect, or unnecessary data from a data set before data analysis.

What more does one need? Regardless, let’s get into the nitty gritty of cleaning our data with these libraries.

2. Data Cleaning With Python
Using Pandas and NumPy, we are now going to walk you through the following series of tasks, listed below. 

Here are the basic data cleaning tasks we’ll tackle:

Importing Libraries
Input Customer Feedback Dataset
Locate Missing Data
Check for Duplicates
Detect Outliers 
Normalize Casing 

## Importing Libraries
Let’s get Pandas and NumPy up and running on your Python script.

INPUT:

import pandas as pd
import numpy as np

##Input Customer Feedback Dataset
Next, we ask our libraries to read a feedback dataset. Let’s see what that looks like.

INPUT:

data = pd.read_csv('feedback.csv')
OUTPUT: 
As you can see the “feedback.csv” should be the dataset you want to examine. And, in this case, when we read “pd.read_csv” as the prior function, we know we are using the Pandas library to read our dataset. 

##Locate Missing Data
Next, we are going to use a secret Python hack known as ‘isnull function’ to discover our data. Actually a common function, 'isnull' helps us find where in our dataset there are missing values. This is useful information as this is what we need to correct while data cleaning.

INPUT:

data.isnull()

So, for example, datapoint 1 has missing data in its Review section and its Review ID section (both are marked true). 

We can further expand the missing data of each feature by coding:

INPUT: 

data.isnull().sum()
OUTPUT:


From here, we use code to actually clean the data. This boils down to two basic options. 1) Drop the data or, 2) Input missing data. If you opt to:

 #Drop the data
You’ll have to make another decision – whether to drop only the missing values and keep the data in the set, or to eliminate the feature (the entire column) wholesale because there are so many missing datapoints that it isn’t fit for analysis.

If you want to drop the missing values you’ll have to go in and mark them void according to Pandas or NumBy standards (see section below). But if you want to drop the entire column, here’s the code:

remove = ['Review ID','Date']
data.drop(remove, inplace =True, axis =1)
OUTPUT: 
Now, let’s examine our other option.

#Input missing data

Technically, the method described above of filling in individual values with Pandas or NumBy standards is also a form of inputting missing data – we call it adding ‘No Review’. When it comes to inputting missing data you can either add ‘No Review’ using the code below, or manually fill in the correct data.

data['Review'] = data['Review'].fillna('No review')

As you can see, now the data point 1 have now been marked as ‘No Review’ – success!

#Check for Duplicates

Duplicates, like missing data, cause problems and clog up analytics software. Let’s locate and eliminate them.

duplicates we start out with:

data.duplicated()

data.drop_duplicates()
 
And there we have it, our dataset with our duplicate removed. Onwards.

#Detect Outliers

Outliers are numerical values that lie significantly outside of the statistical norm. Cutting that down from unnecessary science garble – they are data points that are so out of range they are likely misreads. 

data['Rating'].describe()

data.loc[10,'Rating'] = 1


#Normalize Casing

Last but not least we are going to dot our i’s and cross our t’s. 
Meaning we are going to standardize (lowercase) all review titles so as not to confuse our algorithms, and we are going to capitalize Customer Names, so that our algorithms know they are variables (you’ll see this in action below).

Here’s how to make every review title lowercase:
    data['Review Title'] = data['Review Title'].str.lower()

Here’s how to ensure Customer Name capitalization:

data['Customer Name'] = data['Customer Name'].str.title()

##Takeaways

Staying ahead of competition when it comes to data analysis isn’t easy - it seems like there are more powerful software and new functionalities being developed and launched every day