# SIT742: Modern Data Science 
**(Module: Data Manipulation)**

---
- Materials in this module include resources collected from various open-source online repositories.
- You are free to use, change and distribute this package.
- If you found any issue/bug for this document, please submit an issue at [tulip-lab/sit742](https://github.com/tulip-lab/sit742/issues)


Prepared by **SIT742 Teaching Team**

---


# Data Wrangling

In this session, we will learn how to use Python `pandas` package to do data wrangling.



## Content


### 1. `Pandas` Basics

### 2. Loading `CSV` Data

### 3. Data Extraction through Web `API` 

### 4. Web Crawling using `BeautifulSoup`



**Note**: The data available on those service might be changing, so you need to adjust the code to accommodate those changes.

---

## Part 1. Using `Pandas` to Load `CSV` Data Sets

Here you will learn  how to use Pandas [read_csv()](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html) function to load a CSV file. Before we start importing our CSV file, it might be good for you to read [Pandas tutorial on reading CSV files](http://pandas.pydata.org/pandas-docs/stable/io.html#io-read-csv-table).

If `wget` was not installed in your `Python` platform, install it first:

To get started, we can import `Pandas` with:

In [None]:
import pandas as pd


In [2]:
!pip install wget


fish: Unknown command: pip
fish: 
pip install wget
^


Suppose the `csv` data file is avilable at a URL, we use `wget` to download it to the local file system.


In [None]:
import wget

link_to_data = 'https://github.com/tulip-lab/sit742/raw/master/Jupyter/data/user_raw1.csv'
DataSet = wget.download(link_to_data)

link_to_data = 'https://github.com/tulip-lab/sit742/raw/master/Jupyter/data/user_raw2.csv'
DataSet = wget.download(link_to_data)



### Importing `CSV` data

Importing `CSV` files with `Pandas` function [`read_csv()`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html)  and converting the data into a form Python can understand is simple. It only takes a couple of lines of code. The imported data will be stored in Pandas `DataFrame`.



In [None]:
userdf1 = pd.read_csv("user_raw1.csv")
userdf2 = pd.read_csv("user_raw2.csv")


userdf1.head()
userdf2.head()



## Part 2. Structuring Data


Change the columns name, so that it is consistent.





### 2.2 Renaming Column Names as Per Convenience

This steps involves renaming the colmns names because many column names are confusing and hard to understand.



In [None]:
new_name = {'Sex': 'Gender',
           'Addr.': 'Address'}

userdf1.rename(columns= new_name, inplace = True)

new_name = {'Surname': 'Family Name',
           'First Name': 'Given Name',
           'Addr.': 'Address'}

userdf2.rename(columns= new_name, inplace = True)


### 2.3 Replacing the value of the rows if needed

This involves replacing the values with values more reaable, such as `M` by `Male`, etc. Please complete the code to replace the state abbreviations by its full name, such as `VIC` by `Victoria`, etc.

In [None]:
replace_values = {'M': 'Male', 'F': 'Female'}

userdf1 = userdf1.replace({'Gender': replace_values})
userdf2 = userdf2.replace({'Gender': replace_values})

userdf1.head()
userdf2.head()

# Your code to replace values for the column Address



## Part 3. Data Cleaning

We have seen the selection of 

### 3.1 Removing the Irrelevant Columns

Suppose two irrelevant columns are `PID` and `target`.

In [None]:
to_drop = ['PID', 'target']

userdf1.drop(to_drop, inplace=True, axis = 1)
userdf1.head()

### 3.2 Missing Data

To find and fill the missing data in the dataset we will use another function. There are 4 ways to find the null values if present in the dataset. Let’s see them one by one:

- Using `isnull()`: This function provides the boolean value for the complete dataset to know if any null value is present or not.
- Using `isna().any()`: This function gives a boolean value if any null value is present or not, but it gives results column-wise, not in tabular format.

In [None]:

# please add code to illustrate drop those null value rows, or drop null value columns, one for one df


### 3.3 Data Merging

XXX What to do? to merge them, or to conta

Merging the dataset is the process of combining two datasets in one, and line up rows based on some particular or common property for data analysis. We can do this by using the `merge()` function of the dataframe. Following is the syntax of the merge function:

Left join? or jst Concanate...


In [None]:
userdf1.merge(right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False, suffixes=('_x', '_y'), copy=True, indicator=False, validate=None)

### 3.4 De-Duplicate

De-Duplicate means remove all duplicate values. There is no need for duplicate values in data analysis. These values only affect the accuracy and efficiency of the analysis result. To find duplicate values in the dataset we will use a simple dataframe function i.e. `duplicated()`. Let’s see the example:


userdf.duplicated()

This function also provides bool values for duplicate values in the dataset. 

If a dataset contains duplicate values it can be removed using the `drop_duplicates()` function. Following is the syntax of this function:

In [None]:
userdf.drop_duplicates(subset=None, keep='first', inplace=False, ignore_index=False)

## Part 4. Data Transformation


Data transformation is a common practice in machine learning. 


### 4.1 Using The min-max normaization maximum absolute scaling

Typically, Normalization is a scaling technique in which values are shifted and rescaled so that they end up ranging between 0 and 1. It is also known as Min-Max scaling.

The min-max approach (often called normalization) rescales the feature to a hard and fast range of [0,1] by subtracting the minimum value of the feature then dividing by the range. We can apply the min-max scaling in Pandas using the .min() and .max() methods.


In [None]:
# copy the data
df_min_max_scaled = df.copy()
  
# apply normalization techniques
for column in df_min_max_scaled.columns:
    df_min_max_scaled[column] = (df_min_max_scaled[column] - df_min_max_scaled[column].min()) / (df_min_max_scaled[column].max() - df_min_max_scaled[column].min())    
  

In [None]:
# view normalized data
print(df_min_max_scaled)

### 4.2 Z-Score Standardization

Standardization is another scaling technique where the values are centered around the mean with a unit standard deviation. This means that the mean of the attribute becomes zero and the resultant distribution has a unit standard deviation.

The z-score method (often called standardization) transforms the info into distribution with a mean of 0 and a typical deviation of 1. Each standardized value is computed by subtracting the mean of the corresponding feature then dividing by the quality deviation.

In [None]:
# copy the data
df_z_scaled = df.copy()
  
# apply normalization techniques
for column in df_z_scaled.columns:
    df_z_scaled[column] = (df_z_scaled[column] -
                           df_z_scaled[column].mean()) / df_z_scaled[column].std()    
  

In [None]:
# view normalized data   
display(df_z_scaled)

### 4.2 Export Dataset

This is the last step of the data cleaning process. After performing all the above operations, the data is transformed into clean the dataset and it is ready to export for the next process in Data Science or Data Analysis.