# Instruction for Pandas Beginner

-- Math 210 Project 1

# Introduction:

**Pandas** is a software library written for the Python programming language for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series. pandas is free software released under the three-clause BSD license. The name is derived from the term "Panel data", an econometrics term for multidimensional structured data sets.

## What can Pandas Do:


* Easy handling of missing data (represented as NaN) in floating point as well as non-floating point data
* Size mutability: columns can be inserted and deleted from DataFrame and higher dimensional objects
* Automatic and explicit data alignment: objects can be explicitly aligned to a set of labels, or the user can simply ignore the labels and let Series, DataFrame, etc. automatically align the data for you in computations
* Powerful, flexible group by functionality to perform split-apply-combine operations on data sets, for both aggregating and transforming data
* Make it easy to convert ragged, differently-indexed data in other Python and NumPy data structures into DataFrame objects
* Intelligent label-based slicing, fancy indexing, and subsetting of large data sets
* Intuitive merging and joining data sets
* Flexible reshaping and pivoting of data sets
* Hierarchical labeling of axes (possible to have multiple labels per tick)
* Time series-specific functionality: date range generation and frequency conversion, moving window statistics, moving window linear regressions, date shifting and lagging, etc.



## Pandas Data Type

**Pandas** is one of the hottest new tools available to data science.Pandas is an open-source library that provides high-performance, easy-to-use data structures and data analysis tools.Pandas, built on top of NumPy, offers data structures and operations for manipulating numerical tables and time series.
pandas is well suited for many different kinds of data:

* Tabular data with heterogeneously-typed columns, as in an SQL table or Excel spreadsheet
* Ordered and unordered (not necessarily fixed-frequency) time series data.
* Arbitrary matrix data (homogeneously typed or heterogeneous) with row and column labels
* Any other form of observational / statistical data sets. The data actually need not be labeled at all to be placed into a pandas data structure.
* The two primary data structures of pandas, Series (1-dimensional) and DataFrame (2-dimensional), handle the vast majority of typical use cases in finance, statistics, social science, and many areas of engineering.



## Learining outcome

For data scientists, working with data is typically divided into multiple stages: munging and cleaning data, analyzing / modeling it, then organizing the results of the analysis into a form suitable for plotting or tabular display. Pandas is the ideal tool for all of these tasks.

After learning this instruction, you may install Pandas on your device or know how to use Pandas in Jyputer Notebook,  know how to create and summarize a DataFrame and master basic skill to plot the data of a dataframe.

The tutorial will contain below:
* 1. Installation and Setup (or import Pandas)
* 2. Create, Save and Load DataFrame
    * 2.1 Create Simple DaraFrame
    * 2.2 Save and Load a DataFrame
    * 2.3 Use NumPy to Create DataFrame
* 3. DataFrames Statistics and Information
    * 3.1 Descriptive statistics
    * 3.2 Data Infomation
    * 3.3 Name of columns
* 4. Graphs Plotting
    * 4.1 histogram
    * 4.2 Line Graph
        * 4.2.2 Plot a subset of DataFrame
    * 4.3 Other Graghs and Graph Styling
        * 4.3.1 Pie Chart
        * 4.3.2 Bar Chart
        * 4.3.3 Graph Styling

# TUTORIAL

## 1. Installation and Setup (or import Pandas)

The first step is to install pandas via Anaconda or Minicoda.
Here's the [instruction for Setup](http://stackoverflow.com/documentation/pandas/796/getting-started-with-pandas#t=201703210004262451923)

We can also import Pandas to our Python3 notebook, say it `pd`.

In [None]:
import pandas as pd

## 2. Create, Save and Load DataFrame

### 2.1 Create Simple DaraFrame

We can use `pd.DataFrame` to create a sample Datafram containing sevaral column. 

**following the format:**


`*Name of new frame* = 
pd.DataFrame({
   *Name of column1*: [value(as a series of data)], 
   *Name of column2*: [value(as a series of data)], 
   *Name of column3*: [value(as a series of data)], 
      ...}) `
              
for example:
              

In [None]:
dt_frame1= pd.DataFrame({'Student': [1,2,3,4], 'faculty': ['Arts', 'Science', 'Law','Music']})

In [None]:
dt_frame1

In Pandas,columns are ordered alphabetically. We can also use the `columns` parameter to specify the order.

In [None]:
dt_frame2= pd.DataFrame({'Student': [1,2,3,4], 'faculty': ['Arts', 'Science', 'Law','Music']},columns=['faculty','Student'])

In [None]:
dt_frame2

### 2.2 Save and Load a DataFrame

we save or load a dataframe in pickle(.plk) format using `to_pickle` and `read_pickle` command

**following the format:**

` Save dataframe to pickled pandas object:
df.to_pickle(file_name) # where to save it usually as a .plk`

` Load dataframe from pickled pandas object
df= pd.read_pickle(file_name)`

### 2.3 Use NumPy to Create DataFrame

We can generate random number using numpy. When we create random number DataFrame, we need to set the seed for a reproducible sample.(Usually 0)

In [None]:
import numpy as np

In [None]:
np.random.seed(0)

In [None]:
df1 = pd.DataFrame(np.random.randn(6,4), columns=list('ABCD'))
print(df1)

OR we can use `np.arrage` to create integer sample, with `reshape` to make row and column


In [None]:
df2 = pd.DataFrame(np.arange(18).reshape(3,6),columns=list('ABCDEF'))

print(df2)

## 3. DataFrames Statistics and Information

### 3.1 Descriptive statistics

We can easily get a statistical summary (included mean, standard deviation, number of observations, minimum, maximum, and quartiles) of numerical columns by the `.describe()` command.

** The summary include:**

* min: minimum (smallest observation)
* 25%: lower quartile or first quartile 
* 50%: median (middle value)
* 75%: upper quartile or third quartile 
* max: maximum (largest observation)

In [None]:
df1

In [None]:
df1.describe()

If a column is not numerical, the output will not contain the column

In [None]:
dt_frame2

In [None]:
dt_frame2.describe()

### 3.2 Data Infomation

By using the `.info()` command we can easily get the basic information of a DataFrame of each column and the memory usage

In [None]:
df1

In [None]:
df1.info()

In [None]:

df3 = pd.DataFrame({'integers': [1, 2, 3], 
                   'floats': [2.4, 3.8, 4.9], 
                   'text': ['How', 'Are', 'You'], 
                   'ints with None': [4, None, 9]})
print(df3)

In [None]:
df3.info()

### 3.3 Name of columns

By the using the `list` or `columns` commands we can easily list the name of all columns.

`list(Name of DataFrame)`

In [None]:
list(dt_frame1)

`df.columns`

In [None]:
dt_frame1.columns

## 4. Graphs Plotting

In Pandas, we can create dataframe and handle them easily. But Pandas make graphs of the data by using the `matplotlib`.


In [None]:
import matplotlib.pyplot as plt

*Here Pandas can plot several types of graphs:*

 * histogram

### 4.1 histogram

Histogram is used when we have a pandas series (Use `pd.Series`). The x-axis represents the bin of data while the y-axis represents the frequency of each bin.

**Using the format:** *series.plot(kind='hist')*

Example: Let's generate a radom series by using numpy and plot the histogram of the series

In [None]:
np.random.seed(0)  #specify the seed as zero

numbers=np.random.randn(500) #it will generate an array of normal distributed numbers
series_1 = pd.Series(numbers) # it will generate a pandas series
series_1.plot(kind='hist')
plt.show()

We can also add a graph title


In [None]:
series_1.plot(kind='hist',title='Normal graph')
plt.show()

### 4.2 Line Graph
 
The graph of dataFrame with only number values can be plotted as line graph using the `.plot()` method

In [None]:
df1

In [None]:
df1.plot()
plt.show()

### 4.2.2 Plot a subset of DataFrame

We can plot a subset (column) of a DataFrame following:

`df['column'].plot()`

In [None]:
df1['C'].plot()
plt.show()

### 4.3 Other Graghs and Graph Styling

* Pie Chart
* Bar Chart
* Graph Styling

Create DataFrame 3 as `df3` for example:

In [None]:
df3 = pd.DataFrame({'Group A': [8,5,8,3,5,7,2,5,8,9,3,1,2],'Group B': [4,7,3,4,6,4,2,4,7,2,3,1,7]})

In [None]:
df3

### 4.3.1 Pie Chart

Pie Chart represents the proportion of each section shares, Following the format:

Pie only can be plotted for one sigle subset:

`df['column'].plot(kind='pie)`

In [None]:
df3['Group A'].plot(kind='pie')
plt.show()

Trick: Pie chart is usually ploted as an ellipse, to turn it into a circle, we need `pyplot.axis('equal') ` from pyplot

In [None]:
from matplotlib import pyplot

In [None]:
pyplot.axis('equal')
df3['Group A'].plot(kind='pie')
plt.show()

### 4.3.2 Bar Chart

Bar illustrate the differrece of distribution between two populations(columns). It's simliar to histogram, following:

`df.hist()`

In [None]:
df3.hist()
plt.show()

we can also apply the graph of one column to the other's, which help us too see the difference more clear, following:

df['column to apply'].hist()

In [None]:
df3['Group A'].hist()
plt.show()

### 4.3.3 Graph Styling

By using `df.plot()`, we can optimize and upgrade the graph.

*Examples are shown below:*

* Make the line of Line Plot into single plots

In [None]:
df1.plot(style='o')
plt.show()

* Make the line' color of Line Plot be blue

In [None]:
df1.plot(style='b')
plt.show()

* Make the line' color of Line Plot be red and the lines dashed

In [None]:
df1.plot(style='r--')
plt.show()

# Examples Source and referrence

All example of the tutorial comes from or are motivated by:

* websites: 
https://github.com/vinta/awesome-python

http://stackoverflow.com/documentation/pandas/796/getting-started-with-pandas#t=201703210004262451923

http://pandas.pydata.org/pandas-docs/stable/

* youtube video:
https://www.google.co.in/url?sa=t&rct=j&q=&esrc=s&source=web&cd=2&cad=rja&uact=8&ved=0ahUKEwin_eOMl7bSAhUDoZQKHSYYB-QQtwIIHzAB&url=https%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3DIqjy9UqKKuo&usg=AFQjCNG9c4RZlXbXCkHjCjZhJOSik6V-RQ&sig2=Ek8uWFA_PajTHBqtVksVwg&bvm=bv.148441817,d.dGo



* PDF: http://pandas.pydata.org/pandas-docs/version/0.9.1/pandas.pdf