## What is pandas?
Pandas is an open source library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.

### Pandas resources: 
* <b>Pandas documentation:</b> https://pandas.pydata.org/pandas-docs/stable/
* <b>10 minutes to pandas:</b> https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html#min
* <b>A handy Pandas cheat sheet:</b> https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf
* <b>Pandas cookbook (repository of short and sweet examples):</b> https://pandas.pydata.org/pandas-docs/stable/user_guide/cookbook.html#cookbook
* <b>Pandas excersises and solutions:</b> https://github.com/guipsamora/pandas_exercises

## Why use Pandas? 
For example if you want to explore a dataset stored in a CSV file on your computer. You can use pandas to extract the data from the CSV into a DataFrame, and let your do things like: 
* Calculate statistics and answer questions about the data
    - What's the average, median, max, or min of each column?
    - Does column A correlate with column B?
    - What does the distribution of data in column C look like?
* Cleaning the data by removing missing values, filtering data based on given criteria etc. 
* Data visualization using Matplotlib
* Store the cleaned, transformed data back into a CSV file

Before we can do machine learning we need to have a good understanding of the dataset, pandas is a tool to help us achieve this. 


### Pandas data types
1. Series
2. DataFrame

### Series
A pandas series is similar to a one-dimensional array, and can store data of any type. The first element in the series is assigned index 0. 
```Python
import numpy as np

# create one-dimensional numpy array
a = np.array([5, 8, 12])
```

To create a series we use the pd.Series() method, and pass an array: 
```Python
import pandas as pd

my_series = pd.Series([10,2,13,4])
print(my_series)

Output: 
0    10
1    2
2    13
3    4
dtype: int64
```

The first column is the index, and the second column is the second column contains the elements we added to the series. The values of a Pandas Series are mutable but the size of a Series is immutable and cannot be changed.

Series may also be created from a numpy array:
```Python
import numpy as np
import pandas as pd

# create one-dimensional numpy array
a = np.array(['cat', 'dog', 'horse'])

# create series from numpy array
np_series = pd.Series(a)

Output:
0      cat
1      dog
2    horse
dtype: object
```

Pandas series documentation: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html

### DataFrame
The pandas DataFrame is the primary pandas data structure, and can be seen as a table. The DataFrame organizes data into rows and columns, making it a two-dimensional data structure. The columns can contain different types, and the size of the DataFrame is mutable, and can be modified. Can be thought of as a dict-like container for Series objects. 

<img src="https://media.geeksforgeeks.org/wp-content/uploads/dealing_with_rows_columns.png" width="400" height="400">

To create a DataFrame you can either start from scratch or convert other data structures like Numpy arrays into a DataFrame.
```Python
import pandas as pd

# create a DataFame called df. The first column contains integer values, the second column has string values, the third column has floating points values, and the fourth has boolean values. 
df = pd.DataFrame({
    "Column1": [1, 4, 8, 7, 9],
    "Column2": ['a', 'column', 'with', 'a', 'string'],
    "Column3": [1.23, 23.5, 45.6, 32.1234, 89.453],
    "Column4": [True, False, True, False, True]
})
print(df)

Output: 
Column1 Column2  Column3  Column4
0        1       a   1.2300     True
1        4  column  23.5000    False
2        8    with  45.6000     True
3        7       a  32.1234    False
4        9  string  89.4530     True
```

We can also create DataFrames from lists: 
```Python
import pandas as pd
mylist = [4, 8, 12, 16, 20]
df = pd.DataFrame(mylist)
print(df)

Output: 
  0
0   4
1   8
2  12
3  16
4  20

import pandas as pd
items = [['Phone', 2000], ['TV', 1500], ['Radio', 800]]
df = pd.DataFrame(items, columns=['Item', 'Price'], dtype=float)
print(df)

Output: 
Item   Price
0  Phone  2000.0
1     TV  1500.0
2  Radio   800.0
```

DataFrame documentation: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html#pandas.DataFrame

### Creating a DataFrame using a dictonary


In [3]:
import pandas as pd

# create a simple dataset of people
data = {'Name': ['John', 'Anna', 'Peter', 'Linda'],
       'Location': ['New York', 'Paris', 'Berling', 'London'],
       'Age': [24, 13, 53, 33]
       }

data_pandas = pd.DataFrame(data)

# IPython.display allows "pretty printing" of dataframes
display(data_pandas)

Unnamed: 0,Name,Location,Age
0,John,New York,24
1,Anna,Paris,13
2,Peter,Berling,53
3,Linda,London,33


## Zip lists to build a DataFrame

In [4]:
list_keys = ['Country', 'Population (million)']
list_values = [['United States', 'Russia', 'United Kingdom'], [331, 144, 66]]

# Zip the 2 lists together into one list of (key,value) tuples: zipped
zipped = list(zip(list_keys,list_values))

# Build a dictionary with the zipped list
data = dict(zipped)
print(data)

# Build and inspect a DataFrame 
df = pd.DataFrame(data)
df

{'Country': ['United States', 'Russia', 'United Kingdom'], 'Population (million)': [331, 144, 66]}


Unnamed: 0,Country,Population (million)
0,United States,331
1,Russia,144
2,United Kingdom,66


# Getting started with Pandas

In [5]:
# We import pandas like this: 
import pandas as pd

## Loading and saving data

In [None]:
# reading from a csv file to a DataFrame (this requires that the file foo.csv is located in the same folder as the .ipynb)
df = pd.read_csv('foo.csv')

# if dataset is inside a folder named datasets
df = pd.read_csv('datasets/foo.csv')

In [None]:
# writing to a csv file from a DataFrame
data.to_csv('name_of_csvfile.csv')

# read more about to_csv() parameters here: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_csv.html

In [7]:
# loading dataset from github to a DataFrame
data = pd.read_csv('https://gist.githubusercontent.com/curran/a08a1080b88344b0c8a7/raw/639388c2cbc2120a14dcf466e85730eb8be498bb/iris.csv')
data.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


# Preview data from DataFrames

In [8]:
# can preview the first and the last rows by writing the variable name of the DataFrame
data

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa
5,5.4,3.9,1.7,0.4,setosa
6,4.6,3.4,1.4,0.3,setosa
7,5.0,3.4,1.5,0.2,setosa
8,4.4,2.9,1.4,0.2,setosa
9,4.9,3.1,1.5,0.1,setosa


In [9]:
# view the first 5 rows
data.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


In [10]:
# view the last 5 rows
data.tail()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
145,6.7,3.0,5.2,2.3,virginica
146,6.3,2.5,5.0,1.9,virginica
147,6.5,3.0,5.2,2.0,virginica
148,6.2,3.4,5.4,2.3,virginica
149,5.9,3.0,5.1,1.8,virginica


In [11]:
# both the tail() and head() function hava a parameter n that will let you determine the number of rows to be displayed

# select the first 10 rows of the datasat
data.head(10)

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa
5,5.4,3.9,1.7,0.4,setosa
6,4.6,3.4,1.4,0.3,setosa
7,5.0,3.4,1.5,0.2,setosa
8,4.4,2.9,1.4,0.2,setosa
9,4.9,3.1,1.5,0.1,setosa


# Inspecting data

In [12]:
"""
info: the info method returns the number of rows in the dataframe, the number of columns, the name of each column of the 
dataframe along with the number of non-null values of such columns, and the data type of each column.
"""
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
sepal_length    150 non-null float64
sepal_width     150 non-null float64
petal_length    150 non-null float64
petal_width     150 non-null float64
species         150 non-null object
dtypes: float64(4), object(1)
memory usage: 5.9+ KB


In [13]:
# the describe function returns useful statistical details about the numeric data in the dataframe
data.describe()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width
count,150.0,150.0,150.0,150.0
mean,5.843333,3.054,3.758667,1.198667
std,0.828066,0.433594,1.76442,0.763161
min,4.3,2.0,1.0,0.1
25%,5.1,2.8,1.6,0.3
50%,5.8,3.0,4.35,1.3
75%,6.4,3.3,5.1,1.8
max,7.9,4.4,6.9,2.5


# Accessing specific records


In [19]:
import numpy as np
from numpy.random import randn

np.random.seed(101)
df = pd.DataFrame(randn(5,4),index='A B C D E'.split(),columns='W X Y Z'.split())
df

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


In [20]:
# like a python list, pandas dataframes can be sliced, using exactly the same notation

# get the first 5 rows
df[0:5]

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


In [22]:
# Get all the values in the column 'W':
df['W']

A    2.706850
B    0.651118
C   -2.018168
D    0.188695
E    0.190794
Name: W, dtype: float64

In [26]:
# The values in column 'X' and 'Z', by passing a list of column names: 
df[['W','Z']]

Unnamed: 0,W,Z
A,2.70685,0.503826
B,0.651118,0.605965
C,-2.018168,-0.589001
D,0.188695,0.955057
E,0.190794,0.683509


In [50]:
# Create a new column 'new by adding the values in column 'W' and 'Y':
df['new'] = df['W'] + df['Y']
df

Unnamed: 0,W,X,Y,Z,new
A,2.70685,0.628133,0.907969,0.503826,3.614819
B,0.651118,-0.319318,-0.848077,0.605965,-0.196959
C,-2.018168,0.740122,0.528813,-0.589001,-1.489355
D,0.188695,-0.758872,-0.933237,0.955057,-0.744542
E,0.190794,1.978757,2.605967,0.683509,2.796762


In [51]:
# Removing columns. Axis - Whether to drop labels from the index (0 or ‘index’) or columns (1 or ‘columns’).
# axis = 1 when we want to drop columns
df.drop('new',axis=1)

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


In [52]:
# Drop rows. When dropping rows axis = 0
# Delete row 'E'
df.drop('E',axis=0)

Unnamed: 0,W,X,Y,Z,new
A,2.70685,0.628133,0.907969,0.503826,3.614819
B,0.651118,-0.319318,-0.848077,0.605965,-0.196959
C,-2.018168,0.740122,0.528813,-0.589001,-1.489355
D,0.188695,-0.758872,-0.933237,0.955057,-0.744542


### Selecting rows

In [53]:
# Selecting rows based on label
df.loc['A']

W      2.706850
X      0.628133
Y      0.907969
Z      0.503826
new    3.614819
Name: A, dtype: float64

In [55]:
# Select row based on position, 0 is the first row. 
df.iloc[0]

W      2.706850
X      0.628133
Y      0.907969
Z      0.503826
new    3.614819
Name: A, dtype: float64

Selecting subset of rows and columns:

In [61]:
# Select value in row 'B' and column 'Y'
df.loc['B','Y']

-0.8480769834036315

In [66]:
# Nested list of rows ('A' and 'B') and columns ('W' and 'Y'). 
df.loc[['A','B'],['W','Y']]

Unnamed: 0,W,Y
A,2.70685,0.907969
B,0.651118,-0.848077


## Conditional selection

In [67]:
df

Unnamed: 0,W,X,Y,Z,new
A,2.70685,0.628133,0.907969,0.503826,3.614819
B,0.651118,-0.319318,-0.848077,0.605965,-0.196959
C,-2.018168,0.740122,0.528813,-0.589001,-1.489355
D,0.188695,-0.758872,-0.933237,0.955057,-0.744542
E,0.190794,1.978757,2.605967,0.683509,2.796762


In [68]:
# Return True if values are greater than 0, False if less than 0
df>0

Unnamed: 0,W,X,Y,Z,new
A,True,True,True,True,True
B,True,False,False,True,False
C,False,True,True,False,False
D,True,False,False,True,False
E,True,True,True,True,True


In [69]:
df[df>0]

Unnamed: 0,W,X,Y,Z,new
A,2.70685,0.628133,0.907969,0.503826,3.614819
B,0.651118,,,0.605965,
C,,0.740122,0.528813,,
D,0.188695,,,0.955057,
E,0.190794,1.978757,2.605967,0.683509,2.796762


In [70]:
df[df['W']>0]

Unnamed: 0,W,X,Y,Z,new
A,2.70685,0.628133,0.907969,0.503826,3.614819
B,0.651118,-0.319318,-0.848077,0.605965,-0.196959
D,0.188695,-0.758872,-0.933237,0.955057,-0.744542
E,0.190794,1.978757,2.605967,0.683509,2.796762


In [71]:
df[df['W']>0]['Y']

A    0.907969
B   -0.848077
D   -0.933237
E    2.605967
Name: Y, dtype: float64

In [72]:
df[df['W']>0][['Y','X']]

Unnamed: 0,Y,X
A,0.907969,0.628133
B,-0.848077,-0.319318
D,-0.933237,-0.758872
E,2.605967,1.978757


## Sources: 

Beginner's Tutorial on the Pandas Python Library
https://stackabuse.com/beginners-tutorial-on-the-pandas-python-library/

Dealing with Rows and Columns in Pandas DataFrame
https://www.geeksforgeeks.org/dealing-with-rows-and-columns-in-pandas-dataframe/


The Mastery of Pandas - I
https://medium.com/swlh/the-mastery-of-pandas-i-50156db42125
