# Cleaning Data

We can use Pandas to clean our data so that it is ready for analysis. Three different types of data issues that we will deal with are:
1. Missing Data
2. Duplicate Data
3. Incorrect Data

To take care of these data issues, we are able to use built in pandas functions to clean it efficiently. Before starting to clean the above, we can use the `info()` function to give us a snapshot of the type of data that is in our DataFrame and systematically change data into proper data types.

We can first start by seeing that the `Date` column is an `object` type, whereas the correct type for a date is `datetime`. We can correct this by inputting the follow code.

In [None]:
# In order to make a change to the DataFrame, the changes must be reinstatiated to overwrite with the changes.


## Missing Data

We can also see that from the `info()` function, there are null values in our data from looking at the Non-Null Count column. We can see some columns have a different non-null count, which means that these columns have more null values. To see in greater detail which columns and rows have null values, we can use the `isna()` function.

This output can be difficult to interpret and action. A way to simplify this information is to pair `isna()` with the `sum()` function. Now, we can see the number of null values per each column

We will look at two different ways to deal with missing data:

 1. Removing null values from DataFrame rows and/or columns
 2. Filling null values with a constant or other values from the DataFrame

### Removing null values

To remove columns or rows that contain null values, we can use the `dropna()` function. We can start by dropping the rows where at least one element is null in the row. This reduces our DataFrame from 1887 rows to 1684 rows.

We can confirm that rows have been dropped, as the number of rows compared to the initial DataFrame has decreased. We can also drop columns that have null values rather than rows with the same function, but specifying the axis.



We see that the columns that have any null values have been dropped. 

### Filling null values

However, completely dropping columns and rows that have null values is usually not an efficient method of cleaning data, as lots of data can be lost. 

A more common approach to cleaning these null values is to fill them in using the `fillna()` function. Two popular uses of the function is:

 1. Fill the null values with 0 or any value
 2. Fill the null values based on the next/previous number
 
With these methods, we can still retain valuable data and not drop the entire column/row.

In [None]:
# filling NaNs with 0


In [None]:
# filling NaNs that is equal to the previous non-NaN number "Forward Fill"


In [None]:
# filling NaNs that is equal to the next non-NaN number "Back Fill"


## Duplicate Data

Another data issue that we will solve in this example is to remove duplicate rows from the DataFrame. To idenfity the distinct values in a particular column, we can use the `unique()` function. This will return an array of all the unique column values.

Rather than go through each column in the DataFrame to see all the unique values, we can return a count of unique values for each column in the DataFrame using the `nunique()` function.

Removing duplicate values can be achieved by using the `drop_duplicates` function. Using `drop_duplicates` can be a very efficient way to clean data, especially if no data is supposed to be duplicated. You must be careful when using this function however, as some data is meant to be duplicated on purpose based on the need of the business/situation.

In [None]:
# by default it will drop based on if there is a duplicate for all column values


In [None]:
# Dropping rows based on specific column values


## Error Data

Another common cleaning method is to identify and clean errors in our dataset. Consider the first few lines of the DataFrame in the `ProductID` column.

In [None]:
dataframe_3

We can see that in the `ProductID` column, that some values have a suffix starting with `-`, which denotes that this product falls under a special category, whereas the ones without are regular products. However, the information that indicates that it is a special item should be in a different column, separated from the Product ID.

To do this, we can use the `str.split()` function to slice and separate values based on the values before and after the delimiter. The delimter in this instance would be `-`.

In [None]:
# Note - Without instantiating the function to the actual columns, no change is made on the actual DataFrame 
dataframe_3['ProductID'].str.split('-',expand=True)

In [None]:
# Instantiating the result above to the the DataFrame. Column 0 replaces 'ProductID', column 1 replaces 'SpecialID'.


## Export Data

Pandas can also export the data once we have finished cleaning it. We can use the `to_csv()` function to do so, and the file will be stored in the same directory as where this notebook is in.

In [None]:
dataframe_3.to_csv('export.csv')