# Data Wrangling
## Objectives

*   Handle missing values
*   Correct data format
*   Standardize and normalize data


<h2>What is the purpose of data wrangling?</h2>


Data wrangling is the process of converting data from the initial format to a format that may be better for analysis.

<h3>Import data</h3>

<h4>Import pandas</h4> 


In [16]:
import pandas as pd

Use the Pandas method <b>read_csv()</b> to load the data from the web address. Set the parameter  "names" equal to the Python list "headers".


In [14]:
df = pd.read_csv('data.csv')

Use the method <b>head()</b> to display the first five rows of the dataframe.


In [13]:
# To see what the data set looks like, we'll use the head() method.
print(df.head(10))

              username  rating  \
0           3629pkcerq       5   
1              b*****8       5   
2              m*****2       5   
3              n*****u       5   
4              p*****8       5   
5           anhthu8761       5   
6     thuydung19129995       5   
7           lan1291999       5   
8  phanthitragiang2208       5   
9              b*****6       5   

                                             comment  
0  H√¨nh ·∫£nh ch·ªâ mang tc nh·∫≠n xu, mn n√™n mua nha h...  
1  Mua ·ªü shop 2 l·∫ßn r·ªìi , giao ƒë√∫ng s·ªë l∆∞·ª£ng , ch...  
2  V·∫£i qu·∫ßn ·ªïn. Qu·∫ßn nh∆∞ h√¨nh \nƒê√≥ng g√≥i sp c·∫©n t...  
3  V·∫£i k m·ªèng c≈©ng k d√†y. V·∫£i t√∫i  qu·∫ßn k gi·ªëng m...  
4  Ch·∫•t v·∫£i kh√° ·ªïn, giao h√†ng c≈©ng nhanh,...........  
5  Qu·∫ßn xinh c·ª±c lu√¥n ·∫Ø mn r·∫ª m√† ch·∫•t l∆∞·ª£ng giao ...  
6  qu·∫ßn v·∫£i ƒë·∫πp ch·∫•t m·ªãn.ƒê√£ gi·∫∑t r·ªìi m·∫∑c l√™n from...  
7  H√¨nh ·∫£nh mang t√≠nh ch·∫•t nh·∫≠n xu \nQu·∫ßn ƒë·∫πp l·∫Øm...  
8  Sorry h√¨nh ·∫£nh k

As we can see, several question marks appeared in the dataframe; those are missing values which may hinder our further analysis.

<div>So, how do we identify all those missing values and deal with them?</div> 

<b>How to work with missing data?</b>

Steps for working with missing data:

<ol>
    <li>Identify missing data</li>
    <li>Deal with missing data</li>
    <li>Correct data format</li>
</ol>


<h2 id="identify_handle_missing_values">Identify and handle missing values</h2>

<h3 id="identify_missing_values">Identify missing values</h3>
<h4>Convert "?" to NaN</h4>
In the car dataset, missing data comes with the question mark "?".
We replace "?" with NaN (Not a Number), Python's default missing value marker for reasons of computational speed and convenience. Here we use the function: 
 <pre>.replace(A, B, inplace = True) </pre>
to replace A by B.


In [19]:
import numpy as np

# replace "?" to NaN
df.replace(" ", np.nan, inplace = True)
df.head(5)

Unnamed: 0,username,rating,comment
0,3629pkcerq,5,"H√¨nh ·∫£nh ch·ªâ mang tc nh·∫≠n xu, mn n√™n mua nha h..."
1,b*****8,5,"Mua ·ªü shop 2 l·∫ßn r·ªìi , giao ƒë√∫ng s·ªë l∆∞·ª£ng , ch..."
2,m*****2,5,V·∫£i qu·∫ßn ·ªïn. Qu·∫ßn nh∆∞ h√¨nh \nƒê√≥ng g√≥i sp c·∫©n t...
3,n*****u,5,V·∫£i k m·ªèng c≈©ng k d√†y. V·∫£i t√∫i qu·∫ßn k gi·ªëng m...
4,p*****8,5,"Ch·∫•t v·∫£i kh√° ·ªïn, giao h√†ng c≈©ng nhanh,..........."
...,...,...,...
95,x.changgg,5,"giao nhanh, v·∫£i ok, gi√° r·∫ª m√¨nh, r·∫•t ƒë√°ng mua ..."
96,n*****b,5,"ch·∫•t l∆∞·ª£ng s·∫£n ph·∫©m tuy·ªát v·ªùi, v·∫£i ƒë·∫πp , giao ..."
97,hothikhanhly1501,5,"H√†ng giao nhanh, ƒë·∫πp, gi·ªëng h√¨nh,, c∆° m√† size ..."
98,maihan309,5,Eo67 mac size S v·∫´n hoie r·ªông ·∫°hhhhhüåÄüåÄüåÄüåÄüíûüíûüíûüíûüíûüíû...


<h4>Evaluating for Missing Data</h4>

The missing values are converted by default. We use the following functions to identify these missing values. There are two methods to detect missing data:

<ol>
    <li><b>.isnull()</b></li>
    <li><b>.notnull()</b></li>
</ol>
The output is a boolean value indicating whether the value that is passed into the argument is in fact missing data.


In [33]:
missing_data = df.isnull()
missing_data.head(5)

Unnamed: 0,username,rating,comment
0,False,False,False
1,False,False,False
2,False,False,False
3,False,False,False
4,False,False,False


"True" means the value is a missing value while "False" means the value is not a missing value.


<h4>Count missing values in each column</h4>
<p>
Using a for loop in Python, we can quickly figure out the number of missing values in each column. As mentioned above, "True" represents a missing value and "False" means the value is present in the dataset.  In the body of the for loop the method ".value_counts()" counts the number of "True" values. 
</p>


In [34]:
for column in missing_data.columns.values.tolist():
    print(column)
    print (missing_data[column].value_counts())
    print("")    

username
False    2559
True        8
Name: username, dtype: int64

rating
False    2567
Name: rating, dtype: int64

comment
False    2567
Name: comment, dtype: int64



In [30]:
# simply drop whole row with NaN in "commnet" column
df.dropna(subset=["comment"],
           inplace=True)

# reset index, because we droped two rows

username
False    3011
True        9
Name: username, dtype: int64

rating
False    3020
Name: rating, dtype: int64

comment
False    2567
True      453
Name: comment, dtype: int64



In [35]:
df.head()

Unnamed: 0,username,rating,comment
0,3629pkcerq,5,"H√¨nh ·∫£nh ch·ªâ mang tc nh·∫≠n xu, mn n√™n mua nha h..."
1,b*****8,5,"Mua ·ªü shop 2 l·∫ßn r·ªìi , giao ƒë√∫ng s·ªë l∆∞·ª£ng , ch..."
2,m*****2,5,V·∫£i qu·∫ßn ·ªïn. Qu·∫ßn nh∆∞ h√¨nh \nƒê√≥ng g√≥i sp c·∫©n t...
3,n*****u,5,V·∫£i k m·ªèng c≈©ng k d√†y. V·∫£i t√∫i qu·∫ßn k gi·ªëng m...
4,p*****8,5,"Ch·∫•t v·∫£i kh√° ·ªïn, giao h√†ng c≈©ng nhanh,..........."
