# Cleaning Data

## Problem Statement

* The common problem of data scientists are called 80/20 problem and states that 80% of the time is spent reading, cleaning and reorganizing data and the 20% of time is to analysis
* This notebook will be covering few techniques for organizing and cleaning data

* functions: ````drop```` and its ````inplace```` flag, ````set_index````,````loc````

In [1]:
import pandas as pd
import numpy as np
from openpyxl.workbook import workbook

In [13]:
df_csv = pd.read_csv('Exercise_Files/Names.csv', header=None)
df_csv.columns = ['First','Last', 'Address','City','State','Area Code','Income']
df_csv

Unnamed: 0,First,Last,Address,City,State,Area Code,Income
0,John,Doe,120 jefferson st.,Riverside,NJ,8074,45000
1,Jack,McGinnis,220 hobo Av.,Phila,PA,9119,18000
2,"John ""Da Man""",Repici,120 Jefferson St.,Riverside,NJ,8075,120000
3,Stephen,Tyler,"7452 Terrace ""At the Plaza"" road",SomeTown,SD,91234,90000
4,,Blankman,,SomeTown,SD,298,30000
5,"Joan ""Danger"", Anne",Jet,"9th, at Terrace plc",Desert City,CO,123,68000


### 1. How to drop unnecessary columns
* Imagine thare are numerous columns and some didn't apply to anything we needed. Just can simply drop them.
* It can be create a variable that contains a list of columns
* It can be dropped one single column
* Drop function has many flags like most functions in pandas.

In [14]:
df_csv.drop(columns='Address', inplace=True)
df_csv

Unnamed: 0,First,Last,City,State,Area Code,Income
0,John,Doe,Riverside,NJ,8074,45000
1,Jack,McGinnis,Phila,PA,9119,18000
2,"John ""Da Man""",Repici,Riverside,NJ,8075,120000
3,Stephen,Tyler,SomeTown,SD,91234,90000
4,,Blankman,SomeTown,SD,298,30000
5,"Joan ""Danger"", Anne",Jet,Desert City,CO,123,68000


* The ````inplace```` flag only confirms that the changes are applying only to data frame's current instance, so it doesn't have to write it equal to itself

### 2. How to index columns

* We still want to be able to identify  where people live. In this case, we should make our indexing by area codes.
* Indexing columns let's search objects by unique identification number in a large data set.
* It will be indexing column 'Area Code'
* Assing new change as a variable df_csv to apply on the data frame

In [15]:
df_csv = df_csv.set_index('Area Code')
df_csv

Unnamed: 0_level_0,First,Last,City,State,Income
Area Code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
8074,John,Doe,Riverside,NJ,45000
9119,Jack,McGinnis,Phila,PA,18000
8075,"John ""Da Man""",Repici,Riverside,NJ,120000
91234,Stephen,Tyler,SomeTown,SD,90000
298,,Blankman,SomeTown,SD,30000
123,"Joan ""Danger"", Anne",Jet,Desert City,CO,68000


* Now it can be search by indexes, for example, find all data related to area code 8074

In [18]:
df_csv.loc[8074]

First          John
Last            Doe
City      Riverside
State            NJ
Income        45000
Name: 8074, dtype: object

In [19]:
df_csv.loc[[8074]]

Unnamed: 0_level_0,First,Last,City,State,Income
Area Code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
8074,John,Doe,Riverside,NJ,45000


* We can still index rows with the indexes, like an array with the *i location* function ````iloc````, to get the same exact row

In [20]:
df_csv.iloc[0]

First          John
Last            Doe
City      Riverside
State            NJ
Income        45000
Name: 8074, dtype: object

* for more cleaning purposes, let's take a look at everyone's first name
* This is using a slice method, because ther is no number on the right side of the colon, it is taking the location of index 8074 to the end of the row

In [22]:
df_csv.loc[8074:,'First']

Area Code
8074                    John
9119                    Jack
8075           John "Da Man"
91234                Stephen
298                      NaN
123      Joan "Danger", Anne
Name: First, dtype: object

* There is some the names contain nicknames and quotes
* It nedd to somehow read everyone's first name and use a tring function to split it by spaces
* call the first name column, call each object to it's string value and split it.

In [23]:
df_csv.First.str.split(expand=True)

Unnamed: 0_level_0,0,1,2
Area Code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
8074,John,,
9119,Jack,,
8075,John,"""Da","Man"""
91234,Stephen,,
298,,,
123,Joan,"""Danger"",",Anne


* It can see that it splits every word in the first column into it's own column
* it has to do is somehow grab the first column of the split

In [24]:
df_csv.First = df_csv.First.str.split(expand=True)

In [25]:
df_csv

Unnamed: 0_level_0,First,Last,City,State,Income
Area Code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
8074,John,Doe,Riverside,NJ,45000
9119,Jack,McGinnis,Phila,PA,18000
8075,John,Repici,Riverside,NJ,120000
91234,Stephen,Tyler,SomeTown,SD,90000
298,,Blankman,SomeTown,SD,30000
123,Joan,Jet,Desert City,CO,68000


* The last thing we have to deal with is the in addressable nan value

* To do this, is locate the NaN value of numpy and replace it with a string that we can easily identify
* Use the data frames replace function, search for NaN value and replace it with our own string

In [26]:
df_csv.replace(np.nan,'N/A', regex=True)

Unnamed: 0_level_0,First,Last,City,State,Income
Area Code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
8074,John,Doe,Riverside,NJ,45000
9119,Jack,McGinnis,Phila,PA,18000
8075,John,Repici,Riverside,NJ,120000
91234,Stephen,Tyler,SomeTown,SD,90000
298,,Blankman,SomeTown,SD,30000
123,Joan,Jet,Desert City,CO,68000


* Now the clean data frame can be save it in a excel file

In [27]:
to_excel = df_csv.to_excel('modified.xlsx')