# Basic Data Manipulation with Pandas

Following are the common 4 steps to start a data analysis project. 

* Data Exploration
* Data Filtering and Sorting
* Data Cleaning
* Data Transformation

Data Files Used:
* class1_test1.tsv
* class1_test2.tsv
* class2_test1.tsv
* class1_test1_cleaning.tsv

Import `numpy` and `pandas`.

## 1. Data Exploration

The loaded data may be too large to examine all of them. We check out following aspects of the data to understand it better.
* Number of rows and records
* Data types of columns
* View data samples 
* Basic statistics of each columns
* Basic plotting

### Load CSV File

Load csv file `class1_test1.tsv` in `data` folder.

#### Size of Data

The `dataframe.shape` attribute returns dimensions of the data.

#### Sample Data

The `head()` and `tail()` function returns first and last few rows of the data. You can specify number of sample to be displayed. 

#### Columns and Index

Dataframe is like a table with header. The `columns` attribute gives its column names. 

Each row in a dataframe has an index too, which can be used to fetch that row.
* Index does NOT need to be unique.

#### Dataframe Info and Statistics 

The `dataframe.info()` function is used to get a summary of the dataframe.
* Each column's name, data type and record counts, thus it contains any null data.
* Index type
* Memory usage

### Load Another CSV File (Exercise)

Load another csv file `data/class2_test1.tsv`.

Examine the dataframe using `shape`, `head()`, `tail()`, `info()` and `describe()`.

### Concatenate Dataframes

Before we concatenate 2 dataframes together, add a column `class` to `df1` with value `class1`.

Add a column `class` to `df2` with value `class2`.

Concatenate the 2 datframes together to get new dataframe.

Use `shape` to confirm that 2 dataframes are concatenated together.

The original index values remain the same. 

Selection of an index using `loc[]` may return multiple rows.

#### Reset Index

After merging, the index value of original dataframe remains.

Selection by row index returns only 1 row after resetting index.

After resetting index, the original index column is added to the dataframe as a column.

Drop a column from dataframe in place, which update the dataframe directly.

#### Change Index Column

Select rows by new index.

### Basic Plotting

Summarized steps to create the `df3` dataframe. 

#### Histogram

Histogram represents the frequency of occurrence within fixed intervals of values.

The parameter `bins` can be used to control the granularity of the charts.

The parameter `bins` can also be boundary value of bins.

## 2. Updating and Sorting

* Selecting row(s) and column(s) using index operator, `loc[]` and `iloc[]`
* Boolean filtering
* Assigning values with indexing
* Sorting

### Indexing and Data Selection

**Indexing** means selecting particular rows and columns of data from a DataFrame. 
* Indexing can be used to select individual row, column or item. 
* Indexing can also be used to perform **Subset Selection**.

Pandas uses indexers `[ ]`, `.loc[]` and `.iloc[ ]`.
* `Dataframe[]`: Used for columns selection. Also known as indexing operator.
* `Dataframe.loc[]`: Used for rows selection using **labels**.
* `Dataframe.iloc[]` : Used for rows selection using **positions**.

### Updating Values

Selection in dataframe returns a **view** to the original data. Thus any changes to values in the view will affects original data directly. 

#### Option 2
Create new column `maths_pass` using `apply()` function.

### Exercise: Assign Grades to Students

We would like to assign grades to students' English subject based on their marks.
* Create a Maths Grade Columns 'maths_grade' 
* Maths >= 80  A
* Maths >= 70  B
* Maths >= 60  C
* Else    D

### Sorting

DataFrame supports following 2 sorting functions.
* `sort_index()` 
* `sort_values()`

#### Sort by Row Index

Sort by row index in descending order.

#### Sort by Column Index

Sort by column index in descending order. 

#### Sort by Value(s)

Sorting by values can be done on multiple columns with respective order. 

For example, sort the dataframe by `english` in descending order and followed by `maths` in ascending order.

## 3. Data Cleaning

* Missing Data
* Outliers
* Duplicates
* Type Conversion

Load tsv file `class1_test1_cleaning.tsv` in `data` folder.

### Missing Data

By examine returned values of `info()`, not all columns have same number of data. 

That indicates that there are some missing data in the dataframe.
* Both `maths` and `science` columns have some missing data
* The `religion` column seems to have 0 data

The `isnull()` returns `True` for each cell if its value is `NaN`. 
To find out which columns contain `NaN` value, use `any()` function.

Any student has missing value in ALL his subjects?

Any student has NULL value for all his subjects?

### Handling Missing Data

#### Drop Column(s)

Drop `religion` column since it does not contain any data. 

#### Drop Row(s) with `NaN`

The `dropna()` function drops any rows contains null values. 
* To drop any column with null value(s), supply a parameter `axis=1`.

#### Replace `NaN` with a Value

You can replace `NaN` value with a value using `fillna()` function.
* Depends on application, sometimes it is logical to replace a missing value with mean, median or mode value of that column.

#### Fill Null with Mean of Each Row

For result case, it is more reasonable to use `mean()` value of other subjects (components) as replacement.

Find mean value of each row.
* For rows with complete Null, replace with 0.

Function fillna() has not implemented `axis=1` feature. Thus fillna() by row is not possible at the moment.
* An alternative solution is to transpose dataframe before `fillna()` and transpose it back.
* Reference: https://stackoverflow.com/questions/33058590/pandas-dataframe-replacing-nan-with-row-average

### Outliers

To detect outliers,
* Check basic statistic data of the dataframe
* Use basic plotting to detect outlier records. 

#### Scatter Plots

A scatter chart shows the relationship between two different variables.
* It can reveal the distribution trends. 
* It is used to highlight similarities in a data set. 
* It is useful for understanding the distribution of your data.
* It is commonly used to find outliers. 

#### Box Plots

Box Plot is the visual representation of groups of numerical data through their quartiles.
* Boxplot summarizes a sample data using 25th, 50th and 75th percentiles.
* It captures the summary of the data efficiently with a simple box and whiskers.
* It allows us to compare easily across groups. 
* It is commonly used to detect the outlier in data set. 


A box plot consist of 5 things.
* Minimum
* First Quartile or 25%
* Median (Second Quartile) or 50%
* Third Quartile or 75%
* Maximum

Cap outliers' value.

### Duplicate Values

Reset index of `df3` such that it uses RangeIndex.

Duplicate rows of index `[1, 3, 5]`.

Drop duplicates in dataframe directly.

### Type Conversion

Some marks columns contains null values. In Pandas, only `float` and `object` types can contain null values.

For demonstration purpose, convert `math`, `english`, `science` columns to string format.

Use `DataFrame.astype()` to convert `english` and `maths` columns to **float** type.

Use `pandas.to_numeric()` to convert `science` column to numeric type.