# Pandas

In [3]:
import pandas as pd

## Creating DataFrames

A `DataFrame` (Table) is made up of a few components

* index - Think of it like column that contains the id for the row. In this data set, there is no index
* column

A table with only 1 column is know as a `Series`

In [8]:
#DataFrame({col1: {row1: value11, row2: value12},
#           col2: {row2: value21, row2: value22}})
df = pd.DataFrame({
        'A': {0: 'a', 1: 'b', 2: 'c'},
        'B': {0: 1, 1: 3, 2: 5},
        'C': {0: 2, 1: 4, 2: 6}})
df

Unnamed: 0,A,B,C
0,a,1,2
1,b,3,4
2,c,5,6


In [9]:
df.index.tolist()

[0, 1, 2]

You can easily import data from an excel file to a jupyter notebook using `read_excel`

In [22]:
df = pd.read_excel('C:/Users/smart/OneDrive - Singapore Management University/SMU/Hackwagon/data/enrollment.xlsx')

#Print the top 5 rows using df.head(5)
#You can use display(your_table) instead of print to get a nice table in Jupyter. 
display(df.head(5))

Unnamed: 0,year,school,course_type,course_name,gender,no_of_students
0,2017,School of Business & Accountancy,Full-time,Diploma in Accountancy,Female,468
1,2017,School of Business & Accountancy,Full-time,Diploma in Accountancy,Male,404
2,2017,School of Business & Accountancy,Full-time,Diploma in Banking & Financial Services,Female,126
3,2017,School of Business & Accountancy,Full-time,Diploma in Banking & Financial Services,Male,180
4,2017,School of Business & Accountancy,Full-time,Diploma in Business Information Technology,Female,68


## Common Operations on DataFrames

** Accessing a table **

In [23]:
# table_variable['column_name']

df['year'].head(5)

0    2017
1    2017
2    2017
3    2017
4    2017
Name: year, dtype: int64

**Modifying a Column**

In this example, we are setting the `no_of_students` to a fixed value of 5

In [24]:
new_df= df.copy()

new_df['no_of_students'] = 5

new_df.head(5)

Unnamed: 0,year,school,course_type,course_name,gender,no_of_students
0,2017,School of Business & Accountancy,Full-time,Diploma in Accountancy,Female,5
1,2017,School of Business & Accountancy,Full-time,Diploma in Accountancy,Male,5
2,2017,School of Business & Accountancy,Full-time,Diploma in Banking & Financial Services,Female,5
3,2017,School of Business & Accountancy,Full-time,Diploma in Banking & Financial Services,Male,5
4,2017,School of Business & Accountancy,Full-time,Diploma in Business Information Technology,Female,5


We can use existing columns. Like in excel where you have a formula for cell `C1` as

```
= A1 + B1
```

In pandas you would have

```
df['C'] = df['A'] + df['B']
```

Where the formula applies to the entire column

In [26]:
new_df = df.copy()

# In this example, you are adding 1 to the existing no_of_students column

new_df['no_of_students'] = new_df['no_of_students'] + 1

new_df.head(5)

Unnamed: 0,year,school,course_type,course_name,gender,no_of_students
0,2017,School of Business & Accountancy,Full-time,Diploma in Accountancy,Female,469
1,2017,School of Business & Accountancy,Full-time,Diploma in Accountancy,Male,405
2,2017,School of Business & Accountancy,Full-time,Diploma in Banking & Financial Services,Female,127
3,2017,School of Business & Accountancy,Full-time,Diploma in Banking & Financial Services,Male,181
4,2017,School of Business & Accountancy,Full-time,Diploma in Business Information Technology,Female,69


** Using .pivot() **

Reshape data (produce a “pivot” table) based on column values. Uses unique values from index / columns to form axes of the resulting DataFrame.

In [4]:
df = pd.DataFrame({'gender': ['girl','girl','girl','guy','guy','guy'],
                       'class': ['A', 'B', 'C', 'A', 'B', 'C'],
                       'age': [10, 12, 13, 14, 15, 16],
                      'test_score': [7,6,4,3,2,1]})
df

Unnamed: 0,age,class,gender,test_score
0,10,A,girl,7
1,12,B,girl,6
2,13,C,girl,4
3,14,A,guy,3
4,15,B,guy,2
5,16,C,guy,1


In [7]:
df.pivot(index='gender', columns='class')['test_score']

class,A,B,C
gender,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
girl,7,6,4
guy,3,2,1


** .merge() **

In [30]:
import numpy as np

#imported numpy just to use np.Nan feature below
df_1 = pd.DataFrame({'gender': ['girl','girl',np.NaN,'guy','guy','guy'],
                       'class': ['A', 'B', 'C', 'A', 'B', 'C'],
                       'age': [10, 12, 13, 14, 15, 16],
                      'test_score': [7,6,4,3,2,1]})
df_1

Unnamed: 0,age,class,gender,test_score
0,10,A,girl,7
1,12,B,girl,6
2,13,C,,4
3,14,A,guy,3
4,15,B,guy,2
5,16,C,guy,1


In [31]:
df_2 = pd.DataFrame({'gender': ['girl','guy'],
                       'hair_length': [ 'long', 'short']})
df_2

Unnamed: 0,gender,hair_length
0,girl,long
1,guy,short


<img src="C:/Users/smart/OneDrive - Singapore Management University/SMU/BIA/Curriculum/Workshop-Github/PythonWorkshop/resources/image.png">

In [32]:
df_1.merge(df_2, on='gender', how='outer')

Unnamed: 0,age,class,gender,test_score,hair_length
0,10,A,girl,7,long
1,12,B,girl,6,long
2,13,C,,4,
3,14,A,guy,3,short
4,15,B,guy,2,short
5,16,C,guy,1,short


In [33]:
df_1.merge(df_2, on='gender', how='inner')

Unnamed: 0,age,class,gender,test_score,hair_length
0,10,A,girl,7,long
1,12,B,girl,6,long
2,14,A,guy,3,short
3,15,B,guy,2,short
4,16,C,guy,1,short


In [34]:
df_1.merge(df_2, on='gender', how='left')

Unnamed: 0,age,class,gender,test_score,hair_length
0,10,A,girl,7,long
1,12,B,girl,6,long
2,13,C,,4,
3,14,A,guy,3,short
4,15,B,guy,2,short
5,16,C,guy,1,short


In [35]:
df_1.merge(df_2, on='gender', how='right')

Unnamed: 0,age,class,gender,test_score,hair_length
0,10,A,girl,7,long
1,12,B,girl,6,long
2,14,A,guy,3,short
3,15,B,guy,2,short
4,16,C,guy,1,short
