# Basic Data Manipulation with Pandas

Following are the common 4 steps to start a data analysis project. 

* Data Exploration
* Data Filtering and Sorting
* Data Cleaning
* Data Transformation

In this exercise, we will learn how to use Pandas in these 4 steps. 

Import `numpy` and `pandas`.

## 1. Data Exploration

The loaded data may be too large to examine all of them. We check out following aspects of the data to understand it better.
* Number of rows and records
* Data types of columns
* View data samples 
* Basic statistics of each columns
* Basic plotting

Load csv file `temperature-monthly-mean-daily-maximum.csv` in `data` folder.

#### Size of Data

The `dataframe.shape` attribute returns dimensions of the data.

#### Dataframe Info

The `dataframe.info()` function is used to get a summary of the dataframe.
* Each column's name, data type and record counts, thus it contains any null data.
* Index type
* Memory usage

#### Sample Data

The `head()` and `tail()` function returns first and last few rows of the data.

#### Statistical Information

The `describe()` function provides some basic statistical details like percentile, mean, std etc. of a data frame or a series of numeric values. 

Pandas provides many statistical functions. 
* The `median()` function return the median of the values for the requested axis.
* The `mode()` function returns the most frequent values for the requested axis.

#### Rename columns

To make it easier for future exploration, we can rename some columns.

### Basic Plotting

#### Line Graph

#### Histogram

Histogram represents the frequency of occurrence within fixed intervals of values.

The parameter `bins` can be used to control the granularity of the charts.

## 2. Data Indexing, Filtering and Sorting

* Selecting row(s) and column(s) using index operator, `loc[]` and `iloc[]`
* Boolean filtering
* Assigning values with indexing
* Sorting

### Indexing and Data Selection

**Indexing** means selecting particular rows and columns of data from a DataFrame. 
* Indexing can be used to select individual row, column or item. 
* Indexing can also be used to perform **Subset Selection**.

Pandas uses indexers `[ ]`, `.loc[]` and `.iloc[ ]`.
* `Dataframe[]`: Used for columns selection. Also known as indexing operator.
* `Dataframe.loc[]`: Used for rows selection using **labels**.
* `Dataframe.iloc[]` : Used for rows selection using **positions**.

### Select Columns using Indexing Operator `[]`

Select a single column using its label.

Since single column selection returns a Series, additional `[]` can be added to select rows from resulting series. 

To confirm that `dataframe[]` selects columns by labels, try on a dataframe whose column label is a integer value.

### Slicing Rows using Indexing Operator `[]`

Indexing Operator can also be used to select multiple rows using their positions.

Select first 2 rows.

### Select Rows using `.loc[]`

This function selects rows and/or columns **by their labels**. 
* `.loc[rows]` selects multiple rows of all columns
* `.loc[rows, cols]` select certain rows and columns
* `.loc[:, cols]` select multiple columns of all rows

Modify row index to alphabets.

Update index (row labels) of the dataframe.

Select 2nd row.

Select row 0 and 2.

Select first row 0 and 2 in `name` and `english` columns.

Select all rows in `name` and `english` columns.

### Select Rows using `.iloc[]`

This function selects rows and/or columns **by their positions**.
* `.iloc[rows]` selects multiple rows of all columns
* `.iloc[rows, cols]` select certain rows and columns
* `.iloc[:, cols]` select multiple columns of all rows

Select 2nd row, i.e. row with name = `Adrian`.

Select multiple rows.

Select all rows from columns `name` and `maths`.

### Filtering using Boolean Expression

When series is evaluated in a boolean expression, it returns a Series of boolean values.

Boolean values can be used to filter a dataframe.
* For example, to find out who has passed maths test.

**Exercise:**
* List all who have passed all 3 tests.

### Updating Values

Selection in dataframe returns a **view** to the original data. Thus any changes to values in the view will affects original data directly. 

### Sorting

* `sort_index()` 
* `sort_values()`

#### Sort by Row Index

#### Sort by Column Index

#### Sort by Value(s)

Sorting by values can also be done on multiple columns.

## 3. Data Cleaning

* Missing Data
* Outliers
* Duplicates
* Type Conversion

Load tsv file `class1_test1_cleaning.tsv` in `data` folder.

### Missing Data

By examine returned values of `info()`, not all columns have same number of data. 

That indicates that there are some missing data in the dataframe.
* Both `maths` and `science` columns have some missing data
* The `religion` column seems to have 0 data

The `isnull()` returns `True` if the value is `NaN`. 
To find out which columns contain `NaN` value, use `any()` function.

The `religion` column is empty.

### Handling Missing Data

#### Drop Column(s)

Drop `religion` column since it does not contain any data. 

#### Drop Row(s) with `NaN`

The `dropna()` function drops any rows contains null values. 
* To drop any column with null value(s), supply a parameter `axis=1`.
* To drop any row or column with >= n number of null values, supply a parameter `thresh=n`.

#### Replace `NaN` with a Value

You can replace `NaN` value with a value using `fillna()` function.
* Depends on application, sometimes it is logical to replace a missing value with mean, median or mode value of that column.

#### Forward and Backward Filling

For missing value in some measurements or time series, it is logical to use previous or next value to replace missing values. 

Set the parameter `method` of `fillna()` function to `ffill` to perform forward-filling, or `bfill` to perform backward-filling.  

### Outliers

To detect outliers,
* Check basic statistic data of the dataframe
* Use basic plotting to detect outlier records. 

#### Scatter Plots

A scatter chart shows the relationship between two different variables.
* It can reveal the distribution trends. 
* It is used to highlight similarities in a data set. 
* It is useful for understanding the distribution of your data.
* It is commonly used to find outliers. 

#### Box Plots

Box Plot is the visual representation of groups of numerical data through their quartiles.
* Boxplot summarizes a sample data using 25th, 50th and 75th percentiles.
* It captures the summary of the data efficiently with a simple box and whiskers.
* It allows us to compare easily across groups. 
* It is commonly used to detect the outlier in data set. 


A box plot consist of 5 things.
* Minimum
* First Quartile or 25%
* Median (Second Quartile) or 50%
* Third Quartile or 75%
* Maximum

Cap outliers' value.

### Duplicate Values

Drop duplicates in dataframe directly.

### Type Conversion

Some marks columns contains null values. In Pandas, only `float` and `object` types can contain null values. Thus to convert marks columns to `int`, we need to fix missing data. 

### Update Index

## 4. Basic Data Transformation

* Maths Operations
* Function Applications

### Maths Operations with Scalar Value

You can apply maths operation to all items in the dataframe.

### Subtract Same Value from Columns

### Subtract Same Value from Rows

### Operations between DataFrames

### Function Applications

* `apply()` to apply a function to column or row (with `axis=0`)
* `applymap()` to apply a function to every cell