# Data cleaning and preparation
<font size = 2>
    See all steps in data cleaning process in the sildes posted on Katie in the "box" of week 5 & 6. <br>
    Here, <i>we will learn <u>methods provided in pandas to support data cleaning and preparation</u>.</i> 
<ol>
<li>Handle missing data</li>
    <ul>
    <li>Filtering out missing data: #1 <b>DataFrame.dropna()</b></li>
    <li>Filling in missing data:    #2 <b>DataFrame.fillna()</b></li>
    </ul>
<li>Data transformation</li>
    <ul>
    <li>Remove duplicates: #3 <b>(DataFrame/Series/Index).duplicated()</b> and<b> (DataFrame/Series).drop_duplicates()</b></li>
    <li>Replace values with new ones: #4 <b>(DataFrame/Series).replace</b></li>
    <li>#5 The <b>Series.map() </b>function</li>
    <li>#6 The <b>pandas.get_dummies()</b> function</li>
    <li>#7 The <b>Series.str</b> attribute</li>
    </ul>
</ol>
</font>


In [8]:
import pandas as pd
import numpy as np

In [15]:
#1 DataFrame.dropna()

#---drop rows with at least one missing value.
#---setting inplace = True => the dataframe will not have rows with missing value(s)
#-------------------= False => the dataframe is not changed, but the method returns 
#-------------------           a dataframe without missing values.

data = pd.DataFrame([[1., 6.5, 3.], 
                     [1., np.nan, np.nan],
                     [np.nan, None, np.nan], 
                     [None, 6.5, 3.]])

df = data.dropna(inplace = False)
df.head()

data.dropna(inplace = True)
data.head()

Unnamed: 0,0,1,2
0,1.0,6.5,3.0


In [12]:
#1 DataFrame.dropna()
#--- delete column with missing value(s)
#--- similarly deleting row(s) with missing value(s), but we need to set axis = 'columns'
data = pd.DataFrame([[1., 6.5, 3.], 
                     [1., np.nan, np.nan],
                     [2, np.nan, np.nan], 
                     [2, 6.5, 3.]])

df = data.dropna(inplace = False, axis = 'columns')
df.head()



Unnamed: 0,0
0,1.0
1,1.0
2,2.0
3,2.0


In [16]:
#1 DataFrame.dropna()
#--- when you want to delete a row/column that has all missing values
#--- set how = 'all'
data = pd.DataFrame([[1., 6.5, 3.], 
                     [1., np.nan, np.nan],
                     [None, np.nan, np.nan], 
                     [np.nan, 6.5, 3.]])
df = data.dropna(inplace = False, axis = 'rows', how = 'all')
df.head()

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
3,,6.5,3.0


In [18]:
#1 DataFrame.dropna()
#--- when you want to delete row(s)/column(s) with more than or equal to n missing values
#--- use thresh. The following code will remove rows with more than 1 missing values

data = pd.DataFrame([[1., 6.5, 3.], 
                     [1., np.nan, np.nan],
                     [None, np.nan, np.nan], 
                     [np.nan, 6.5, 3.]])
df = data.dropna(inplace = False, axis = 'rows', thresh = 2)
df.head()

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
3,,6.5,3.0


In [57]:
#2 DataFrame.fillna()
#---fill missing values with a constant

data = pd.DataFrame([[1., 6.5, 3.], 
                     [1., np.nan, np.nan],
                     [None, np.nan, np.nan], 
                     [np.nan, 6.5, 3.]])
df = data.fillna(111, inplace = False)
df.head()

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,111.0,111.0
2,111.0,111.0,111.0
3,111.0,6.5,3.0


In [58]:
#2 DataFrame.fillna()
#---fill missing values with a dictionary: this way, we can fill missing values 
#---for different columns with different values


data = pd.DataFrame(data = [[1., 6.5, 3., 'a', 'b'], 
                     [1., np.nan, np.nan, 'a', 'b'],
                     [None, np.nan, np.nan, None, 'bb'], 
                     [np.nan, 6.5, 3., 'aaa', None]], 
                   columns = ['one', 'two', 'three', 'AA', 'BB'])
df = data.fillna({'one': 111, 'two': 222, 'three': 333, 'AA': 'allA', 'BB': 'allB' }, inplace = False)
df.head()


Unnamed: 0,one,two,three,AA,BB
0,1.0,6.5,3.0,a,b
1,1.0,222.0,333.0,a,b
2,111.0,222.0,333.0,allA,bb
3,111.0,6.5,3.0,aaa,allB


In [60]:
#2 DataFrame.fillna()
#--- fill missing values with neighbors (previous or next) by setting up the
#--- parameters method = "ffill" (forward, using the previous neighbor) / 
#---"bfill" (backward, using the next neighbor) and limit = n (fill n missing values at most)
#---here the direction is column-based

data = pd.DataFrame(data = [[1., 6.5, 3., None, 'b'], 
                     [1., np.nan, np.nan, None, 'b'],
                     [None, np.nan, np.nan, None, 'bb'], 
                     [np.nan, 6.5, 3., 'aaa', None]], 
                   columns = ['one', 'two', 'three', 'AA', 'BB'])

df = data.fillna(method = 'bfill', limit = 2)
df.head()

#df = data.fillna(method = 'bfill', limit = 1)
#df.head()



Unnamed: 0,one,two,three,AA,BB
0,1.0,6.5,3.0,,b
1,1.0,6.5,3.0,aaa,b
2,,6.5,3.0,aaa,bb
3,,6.5,3.0,aaa,


In [None]:
data = pd.DataFrame(data = [[1., 6.5, 3., 'a', 'b'], 
                     [1., np.nan, np.nan, 'a', 'b'],
                     [None, np.nan, np.nan, None, 'bb'], 
                     [np.nan, 6.5, 3., 'aaa', None]], 
                   columns = ['one', 'two', 'three', 'AA', 'BB'])
#your turn:

#---filling missing values in each column with [1] the mean of that column 
#---if the column contains numbers or [2] the mode if the column contains text 

In [62]:
#3 (DataFrame/Series/Index).duplicated() and (DataFrame/Series).drop_duplicates()
#---DataFrame.duplicated() return a Boolean series indicating each row is duplicate or not
#---use the same manner for Series and Index

#---DataFrame.drop_duplicates() returns a df without any duplicates
#---use the same manner for Series


data = pd.DataFrame(data = [[1, 2, 3], 
                            [1, 2, 3],
                            [4, 5, 6]],
                   columns = ['one', 'two', 'three'])
data.duplicated()

df = data.drop_duplicates()
df.head()


Unnamed: 0,one,two,three
0,1,2,3
2,4,5,6


In [63]:
#3 (DataFrame/Series/Index).duplicated() and (DataFrame/Series).drop_duplicates()
#--- when you want to see duplicates in several columns
#---passing keep = 'first'/'last' to keep the first or last row in the lists of duplicated rows
data = pd.DataFrame(data = [[1, 2, 3], 
                            [1, 2, 111],
                            [4, 5, 6]],
                   columns = ['one', 'two', 'three'])
data.duplicated(subset = ['one', 'two'])

df = data.drop_duplicates(subset = ['two'], keep = 'last')
df.head()

Unnamed: 0,one,two,three
1,1,2,111
2,4,5,6


In [64]:
#4 (DataFrame/Series).replace()

data = pd.DataFrame(data = [[1, 2, 3], 
                            [1, 2, 3],
                            [2, 5, 6]],
                   columns = ['one', 'two', 'three'])

#--- replace a scalar with another scalar

df = data.replace(1, 111, inplace = False)
df.head()


#---replace a list of values with a scalar
df = data.replace([1, 2], 1122, inplace = False)
df.head()


#---replace using a dictionary
df = data.replace({1: 111, 2:222}, inplace = False)
df.head()

#your turn: replace 2 in the 'two' column with 22  
df = data.copy()
df['two'] = data['two'].replace({2:22});
df.head()

Unnamed: 0,one,two,three
0,1,22,3
1,1,22,3
2,2,5,6


In [47]:
#5 Series.map()
#--- the map() function takes a dictionary-like object or a function as an input
#--- and performs a mapping to produce the output
#---=> map() is a good choice for replacing values.
#--- more for map(), apply(), and applymap() https://towardsdatascience.com/introduction-to-pandas-apply-applymap-and-map-5d3e044e93ff 

def f1(x):
    if x % 2 == 0:
        return x * 200
    else:
        return -1
    
data = pd.DataFrame(data = [[1, 2, 3], 
                            [1, 2, 3],
                            [2, 5, 6],
                            [222, 5, 6]],
                   columns = ['one', 'two', 'three'])

out = data['one'].map({1 : 'a', 2: 'aa'})
print(out)

out = data['one'].map(lambda x: x**2)
print(out)

out = data['one'].map(lambda x: x * 100 if x % 2 == 0 else -1)
print(out)

out = data['one'].map(f1)
print(out)

#your turn: add a "sum" column to the "data" dataframe by summing up values in each row 

0     a
1     a
2    aa
Name: one, dtype: object
0    1
1    1
2    4
Name: one, dtype: int64
0     -1
1     -1
2    200
Name: one, dtype: int64
0     -1
1     -1
2    400
Name: one, dtype: int64


In [65]:
#6 pandas.get_dummies() => returns a dataframe

#---converts a categorical variable into a dummy or indicator matrix. 
#---If a column in a DataFrame has k distinct values, you would derive a matrix or 
#---DataFrame with k columns containing all 1s and 0s
#---in other words, this function is to produce one-hot vector from categorical values

data = pd.DataFrame(data = [[1, 2, 'MA'], 
                            [1, 2, 'IA'],
                            [2, 5, 'NY']],
                   columns = ['one', 'two', 'state'])

one_hot = pd.get_dummies(data.state, prefix = data.state.name)
one_hot.head()

df = pd.concat([data,one_hot], join = 'inner', axis = 'columns')
df.head()


Unnamed: 0,one,two,state,state_IA,state_MA,state_NY
0,1,2,MA,0,1,0
1,1,2,IA,1,0,0
2,2,5,NY,0,0,1


In [69]:
#7 Series.str attribute
#---this attribute provides methods to work with strings

my_series = pd.Series(['aa aaa', 'bb bbb bbbb', 'cc ccc', 'dd ddd', 'ee eee eeee'])

#upper = my_series.str.upper()
#print(upper)

#contain_a = my_series.str.contains('a')
#print(contain_a)

split = my_series.str.split(' ')
print(split)

split = split.astype('string')
print(split)


0          [aa, aaa]
1    [bb, bbb, bbbb]
2          [cc, ccc]
3          [dd, ddd]
4    [ee, eee, eeee]
dtype: object
0            ['aa', 'aaa']
1    ['bb', 'bbb', 'bbbb']
2            ['cc', 'ccc']
3            ['dd', 'ddd']
4    ['ee', 'eee', 'eeee']
dtype: string


In [None]:
data = pd.DataFrame(data = [[1, 2, 'MA'], 
                            [1, 2, 'IA'],
                            [2, 5, 'NY']],
                   columns = ['one', 'two', 'state'])

#your turn
#---create a dataframe from the "data" dataframe by taking all rows with 
#---state containning 'A'

#show as many ways as you can 
