### Acknowledgement
This notebook contains material from the following resources:
1. https://jakevdp.github.io/PythonDataScienceHandbook/03.00-introduction-to-pandas.html
2. https://www.w3schools.com/python/pandas/default.asp

### Introduction
Pandas is a Python library used for data analysis and manipulation. It has functions for analyzing, cleaning, exploring, and manipulating data.

The name "Pandas" has a reference to "Python Data Analysis" and was created by Wes McKinney in 2008. 

To use Pandas, first import ``Pandas`` under the alias ``pd``.

**alias:** In Python alias are an alternate name for referring to the same thing.

Create an alias with the ``as`` keyword while importing:



In [2]:
import pandas as pd

### Checking the version of Pandas
The version string is stored under __version__ attribute.

In [3]:
pd.__version__

'0.23.4'

### Fundamental  Data Structures in Pandas

The two fundamental Pandas data structures are: 
- Series
- DataFrame


### The Pandas Series Object

A Pandas Series is a **one-dimensional array** of indexed data. It can be created from a list or array as follows:



In [9]:
a = [1, 7, 2]

myvar = pd.Series(a)

print(myvar)


0    1
1    7
2    2
dtype: int64


As we see in the output, the Series wraps both a sequence of values and a sequence of indices, which we can access with the ``values`` and ``index`` attributes. 



In [13]:
myvar.values

array([1, 7, 2], dtype=int64)

In [14]:
myvar.index

RangeIndex(start=0, stop=3, step=1)

### Labels
If nothing else is specified, the values are labeled with their index number. First value has index 0, second value has index 1 etc.

This label can be used to access a specified value.

In [15]:
myvar[0]

1

### Create Labels
With the ``index`` argument, you can name your own labels.

In [16]:
a = [1, 7, 2]

myvar = pd.Series(a, index = ["x", "y", "z"])

print(myvar)

x    1
y    7
z    2
dtype: int64


In [17]:
myvar["x"]

1

### Creating Series from Dictionary


In [34]:
population_dict = {'California': 38332521,
                   'Texas': 26448193,
                   'New York': 19651127,
                   'Florida': 19552860,
                   'Illinois': 12882135}
population = pd.Series(population_dict)
population


California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
dtype: int64

In [33]:
population

a    26448193
b    38332521
c    12882135
f    19552860
x    19651127
dtype: int64

Unlike a dictionary, though, the Series also supports array-style operations such as slicing:

In [26]:
population['California': 'Florida']

California    38332521
Texas         26448193
New York      19651127
Florida       19552860
dtype: int64

In [27]:
population_dict['California': 'Florida']

TypeError: unhashable type: 'slice'

## DataFrames
Data sets in Pandas are usually multi-dimensional tables, called DataFrames.

Series is like a column, a DataFrame is the whole table.

### Constructing DataFrame objects

A Pandas DataFrame can be constructed in a variety of ways. Here we'll give several examples.

#### From a single Series object
A DataFrame is a collection of Series objects, and a single-column DataFrame can be constructed from a single Series:



In [39]:
pd.DataFrame({'population': population})

Unnamed: 0,population
California,38332521
Texas,26448193
New York,19651127
Florida,19552860
Illinois,12882135


#### From a dictionary of Series objects

A DataFrame can be constructed from a dictionary of Series objects as well:

In [36]:
area_dict = {'California': 423967, 'Texas': 695662, 'New York': 141297,
             'Florida': 170312, 'Illinois': 149995}
area = pd.Series(area_dict)

In [37]:
states = pd.DataFrame({'population':population,'area':area})

In [38]:
states

Unnamed: 0,population,area
California,38332521,423967
Texas,26448193,695662
New York,19651127,141297
Florida,19552860,170312
Illinois,12882135,149995


#### Retrieve Columns


In [46]:
states.columns

Index(['population', 'area'], dtype='object')

#### Retrieve Row Indices

In [47]:
states.index

Index(['California', 'Texas', 'New York', 'Florida', 'Illinois'], dtype='object')

### Locate Row
As you can see from the result above, the DataFrame is like a table with rows and columns.

Pandas use the loc attribute to return one or more specified row(s)

In [49]:
states.loc['California']

population    38332521
area            423967
Name: California, dtype: int64

In [51]:
states.loc[['California','Florida']]

Unnamed: 0,population,area
California,38332521,423967
Florida,19552860,170312


### Load Files Into a DataFrame


In [52]:
iris = pd.read_csv('dataset/iris.csv')

In [56]:
iris


Unnamed: 0,sepal.length,sepal.width,petal.length,petal.width,variety
0,5.1,3.5,1.4,0.2,Setosa
1,4.9,3.0,1.4,0.2,Setosa
2,4.7,3.2,1.3,0.2,Setosa
3,4.6,3.1,1.5,0.2,Setosa
4,5.0,3.6,1.4,0.2,Setosa
5,5.4,3.9,1.7,0.4,Setosa
6,4.6,3.4,1.4,0.3,Setosa
7,5.0,3.4,1.5,0.2,Setosa
8,4.4,2.9,1.4,0.2,Setosa
9,4.9,3.1,1.5,0.1,Setosa


### Viewing the Data
One of the most used method for getting a quick overview of the DataFrame, is the head() method.

The `head()` method returns the headers and a specified number of rows, starting from the top.

In [57]:
iris.head()

Unnamed: 0,sepal.length,sepal.width,petal.length,petal.width,variety
0,5.1,3.5,1.4,0.2,Setosa
1,4.9,3.0,1.4,0.2,Setosa
2,4.7,3.2,1.3,0.2,Setosa
3,4.6,3.1,1.5,0.2,Setosa
4,5.0,3.6,1.4,0.2,Setosa


There is also a ``tail()`` method for viewing the last rows of the DataFrame.

The ``tail()`` method returns the headers and a specified number of rows, starting from the bottom.



In [58]:
iris.tail()

Unnamed: 0,sepal.length,sepal.width,petal.length,petal.width,variety
145,6.7,3.0,5.2,2.3,Virginica
146,6.3,2.5,5.0,1.9,Virginica
147,6.5,3.0,5.2,2.0,Virginica
148,6.2,3.4,5.4,2.3,Virginica
149,5.9,3.0,5.1,1.8,Virginica


### Info About the Data
The DataFrames object has a method called ``info()``, that gives you more information about the data set.



In [59]:
iris.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
sepal.length    150 non-null float64
sepal.width     150 non-null float64
petal.length    150 non-null float64
petal.width     150 non-null float64
variety         150 non-null object
dtypes: float64(4), object(1)
memory usage: 5.9+ KB


In [60]:
iris.describe()

Unnamed: 0,sepal.length,sepal.width,petal.length,petal.width
count,150.0,150.0,150.0,150.0
mean,5.843333,3.057333,3.758,1.199333
std,0.828066,0.435866,1.765298,0.762238
min,4.3,2.0,1.0,0.1
25%,5.1,2.8,1.6,0.3
50%,5.8,3.0,4.35,1.3
75%,6.4,3.3,5.1,1.8
max,7.9,4.4,6.9,2.5


### Check Distribution of Values in String column

In [61]:
iris['variety'].value_counts()

Versicolor    50
Setosa        50
Virginica     50
Name: variety, dtype: int64

### Create New Column

In [62]:
iris['sepal'] = iris['sepal.length']+iris['sepal.width']

In [63]:
iris

Unnamed: 0,sepal.length,sepal.width,petal.length,petal.width,variety,sepal
0,5.1,3.5,1.4,0.2,Setosa,8.6
1,4.9,3.0,1.4,0.2,Setosa,7.9
2,4.7,3.2,1.3,0.2,Setosa,7.9
3,4.6,3.1,1.5,0.2,Setosa,7.7
4,5.0,3.6,1.4,0.2,Setosa,8.6
5,5.4,3.9,1.7,0.4,Setosa,9.3
6,4.6,3.4,1.4,0.3,Setosa,8.0
7,5.0,3.4,1.5,0.2,Setosa,8.4
8,4.4,2.9,1.4,0.2,Setosa,7.3
9,4.9,3.1,1.5,0.1,Setosa,8.0


### Delete a Column

In [64]:
del iris['sepal']

In [65]:
iris

Unnamed: 0,sepal.length,sepal.width,petal.length,petal.width,variety
0,5.1,3.5,1.4,0.2,Setosa
1,4.9,3.0,1.4,0.2,Setosa
2,4.7,3.2,1.3,0.2,Setosa
3,4.6,3.1,1.5,0.2,Setosa
4,5.0,3.6,1.4,0.2,Setosa
5,5.4,3.9,1.7,0.4,Setosa
6,4.6,3.4,1.4,0.3,Setosa
7,5.0,3.4,1.5,0.2,Setosa
8,4.4,2.9,1.4,0.2,Setosa
9,4.9,3.1,1.5,0.1,Setosa
