# Pandas

<b>Pandas</b> is an open source library providing high-performance, easy-to-use data structures and data analysis tools for Python

It's name has been derived from <b> "panel data" </b> which is an econometrics term for data sets that include observations over multiple time periods for the same individuals. It stores data in a tabuar format consisting of rows and columns


## Importing pandas

In [1]:
import pandas as pd

## Series

A `Series` is a one-dimensional  <b>object</b> similar to an array, list, or column in a table. It will assign a labeled index[0,1...n]
to each item in the Series.

In [2]:
s = pd.Series(['Apple', 'Banana', 43, 65.6, 'Final'])
print(s)
print("The first element is", s[0])

0     Apple
1    Banana
2        43
3      65.6
4     Final
dtype: object
The first element is Apple


Change index in series

In [3]:
s = pd.Series(['Apple', 'Banana', 'Guava', 'Tomato', 'Potato'], index=['1', '2', '3', '4', '5'])
s

1     Apple
2    Banana
3     Guava
4    Tomato
5    Potato
dtype: object

The `Series` constructor can convert a dictonary as well, using the keys of the dictionary as its index.

In [4]:
dictionary = {'Chicago': 1000, 'New York': 1300, 'Portland': 900, 'San Francisco': 1100,
     'Austin': 450, 'Boston': None}
cities = pd.Series(dictionary)
cities

Austin            450.0
Boston              NaN
Chicago          1000.0
New York         1300.0
Portland          900.0
San Francisco    1100.0
dtype: float64

You can use the index to select specific items from the Series ...

In [5]:
cities['Chicago']

1000.0

In [6]:
#Use multiple indexes
cities[['Chicago', 'Portland', 'San Francisco']]

Chicago          1000.0
Portland          900.0
San Francisco    1100.0
dtype: float64

Or you can use boolean indexing for selection.

In [9]:
cities[cities < 1000]

Austin      450.0
Portland    900.0
dtype: float64

You can also change the values in a Series on the fly.

In [10]:
# changing based on the index
print('Old value:', cities['Chicago'])
cities['Chicago'] = 1400
print('New value:', cities['Chicago'])

Old value: 1000.0
New value: 1400.0


What if you aren't sure whether an item is in the Series? You can check using the following statement.

In [11]:
print('Seattle' in cities)
print('San Francisco' in cities)

#You can also store in a variable like this
is_seattle_in_cities = 'Seattle' in cities
print(is_seattle_in_cities)

False
True
False


## PIT STOP

Here, let us revise what we've gone through with Series 

In [12]:
#Create a new dictionary using the name:height of the guy/girl on your left or guy/girl on your right and your name:height
#E.g.  {"Shubham": 181, "Gabriel": 144}

name_height_dict = {"Shubham": 181, "Gabriel": 144}
name_height_dict

{'Gabriel': 144, 'Shubham': 181}

In [13]:
#Now convert this dictionary into a pandas series
name_height_series = pd.Series(name_height_dict)
name_height_series

Gabriel    144
Shubham    181
dtype: int64

In [14]:
#Print the name of the person who is taller
if(name_height_series['Gabriel'] > name_height_series['Shubham']):
    print("Gabriel")
else:
    print("Shubham")

Shubham


In [18]:
#Print the height of the people who are below 150cm
name_height_series[name_height_series<150]

Gabriel    144
dtype: int64

In [19]:
#Check if "Kim" is in your series
print("Kim" in name_height_series)

False


## DataFrames

A `DataFrame` (Table) is made up of a few components

* index - Think of it like column that contains the id for the row. In this data set, there is no index
* column


In [20]:
#DataFrame({col1: {row1: value11, row2: value12},
#           col2: {row2: value21, row2: value22}})
df = pd.DataFrame({
        'A': {0: 'a', 1: 'b', 2: 'c'},
        'B': {0: 1, 1: 3, 2: 5},
        'C': {0: 2, 1: 4, 2: 6}})
df

Unnamed: 0,A,B,C
0,a,1,2
1,b,3,4
2,c,5,6


In [21]:
#Calculate mean per column
df.mean() #Also written as df.mean(axis=0)

B    3.0
C    4.0
dtype: float64

In [22]:
#Calculate mean per row
df.mean(axis=1)

0    1.5
1    3.5
2    5.5
dtype: float64

In [23]:
#Get a list of indices in the dataframe
df.index.tolist()

[0, 1, 2]

You can easily import data from an excel file to a jupyter notebook using `read_excel`

In [24]:
df = pd.read_excel('C:/Users/smart/OneDrive - Singapore Management University/SMU/BIA/Curriculum/Workshop-Github/PythonWorkshop/resources/enrollment.xlsx')
df

Unnamed: 0,year,school,course_type,course_name,gender,no_of_students
0,2017,School of Business & Accountancy,Full-time,Diploma in Accountancy,Female,468
1,2017,School of Business & Accountancy,Full-time,Diploma in Accountancy,Male,404
2,2017,School of Business & Accountancy,Full-time,Diploma in Banking & Financial Services,Female,126
3,2017,School of Business & Accountancy,Full-time,Diploma in Banking & Financial Services,Male,180
4,2017,School of Business & Accountancy,Full-time,Diploma in Business Information Technology,Female,68
5,2017,School of Business & Accountancy,Full-time,Diploma in Business Information Technology,Male,106
6,2017,School of Business & Accountancy,Full-time,Diploma in Business Studies,Female,492
7,2017,School of Business & Accountancy,Full-time,Diploma in Business Studies,Male,403
8,2017,School of Business & Accountancy,Full-time,Diploma in International Business,Female,68
9,2017,School of Business & Accountancy,Full-time,Diploma in International Business,Male,61


## df.head()

If you don't want to print all the rows but just the top few to see the data, then you can use df.head()

In [25]:
df.head(3)

Unnamed: 0,year,school,course_type,course_name,gender,no_of_students
0,2017,School of Business & Accountancy,Full-time,Diploma in Accountancy,Female,468
1,2017,School of Business & Accountancy,Full-time,Diploma in Accountancy,Male,404
2,2017,School of Business & Accountancy,Full-time,Diploma in Banking & Financial Services,Female,126


With `read_table`, you can also create Dataframes using URLs

In [26]:
url = 'https://raw.github.com/gjreda/best-sandwiches/master/data/best-sandwiches-geocode.tsv'

# fetch the text from the URL and read it into a DataFrame
df_url = pd.read_table(url, sep='\t')
df_url.head(3)

Unnamed: 0,rank,sandwich,restaurant,description,price,address,city,phone,website,full_address,formatted_address,lat,lng
0,1,BLT,Old Oak Tap,The B is applewood smoked&mdash;nice and snapp...,$10,2109 W. Chicago Ave.,Chicago,773-772-0406,theoldoaktap.com,"2109 W. Chicago Ave., Chicago","2109 West Chicago Avenue, Chicago, IL 60622, USA",41.895734,-87.67996
1,2,Fried Bologna,Au Cheval,Thought your bologna-eating days had retired w...,$9,800 W. Randolph St.,Chicago,312-929-4580,aucheval.tumblr.com,"800 W. Randolph St., Chicago","800 West Randolph Street, Chicago, IL 60607, USA",41.884672,-87.647754
2,3,Woodland Mushroom,Xoco,Leave it to Rick Bayless and crew to come up w...,$9.50.,445 N. Clark St.,Chicago,312-334-3688,rickbayless.com,"445 N. Clark St., Chicago","445 North Clark Street, Chicago, IL 60654, USA",41.890602,-87.630925


If there is a CSV file, you can also import using `read_csv()`

In [27]:
# You will be using this data in your exercise later on...
sal = pd.read_csv("Salaries.csv")

In [28]:
#If you don't specify any parameter, you will get 5 rows by default
sal.head()

Unnamed: 0,Id,EmployeeName,JobTitle,BasePay,OvertimePay,OtherPay,Benefits,TotalPay,TotalPayBenefits,Year,Notes,Agency,Status
0,1,NATHANIEL FORD,GENERAL MANAGER-METROPOLITAN TRANSIT AUTHORITY,167411.18,0.0,400184.25,,567595.43,567595.43,2011,,San Francisco,
1,2,GARY JIMENEZ,CAPTAIN III (POLICE DEPARTMENT),155966.02,245131.88,137811.38,,538909.28,538909.28,2011,,San Francisco,
2,3,ALBERT PARDINI,CAPTAIN III (POLICE DEPARTMENT),212739.13,106088.18,16452.6,,335279.91,335279.91,2011,,San Francisco,
3,4,CHRISTOPHER CHONG,WIRE ROPE CABLE MAINTENANCE MECHANIC,77916.0,56120.71,198306.9,,332343.61,332343.61,2011,,San Francisco,
4,5,PATRICK GARDNER,"DEPUTY CHIEF OF DEPARTMENT,(FIRE DEPARTMENT)",134401.6,9737.0,182234.59,,326373.19,326373.19,2011,,San Francisco,


## Common Operations on DataFrames

** Accessing a table **

In [29]:
# table_variable['column_name']

df['year'].head(5)

0    2017
1    2017
2    2017
3    2017
4    2017
Name: year, dtype: int64

**Modifying a Column**

In this example, we are setting the `no_of_students` to a fixed value of 5

In [30]:
new_df= df.copy()

new_df['no_of_students'] = 5

new_df.head(5)


Unnamed: 0,year,school,course_type,course_name,gender,no_of_students
0,2017,School of Business & Accountancy,Full-time,Diploma in Accountancy,Female,5
1,2017,School of Business & Accountancy,Full-time,Diploma in Accountancy,Male,5
2,2017,School of Business & Accountancy,Full-time,Diploma in Banking & Financial Services,Female,5
3,2017,School of Business & Accountancy,Full-time,Diploma in Banking & Financial Services,Male,5
4,2017,School of Business & Accountancy,Full-time,Diploma in Business Information Technology,Female,5


We can use existing columns. Like in excel where you have a formula for cell `C1` as

```
= A1 + B1
```

In pandas you would have

```
df['C'] = df['A'] + df['B']
```

Where the formula applies to the entire column

In [32]:
new_df = df.copy()

# In this example, you are adding 1 to the existing no_of_students column

print(df.head()[['course_name', 'no_of_students']])
new_df['no_of_students'] = new_df['no_of_students'] + 1
print("\n")
print(new_df.head()[['course_name', 'no_of_students']])

                                  course_name  no_of_students
0                      Diploma in Accountancy             468
1                      Diploma in Accountancy             404
2     Diploma in Banking & Financial Services             126
3     Diploma in Banking & Financial Services             180
4  Diploma in Business Information Technology              68


                                  course_name  no_of_students
0                      Diploma in Accountancy             469
1                      Diploma in Accountancy             405
2     Diploma in Banking & Financial Services             127
3     Diploma in Banking & Financial Services             181
4  Diploma in Business Information Technology              69


** Using .pivot() **

Reshape data (produce a “pivot” table) based on column values. Uses unique values from index / columns to form axes of the resulting DataFrame.

In [37]:
df_pivot = pd.DataFrame({'gender': ['girl','girl','girl','guy','guy','guy'],
                       'class': ['A', 'B', 'C', 'A', 'B', 'C'],
                       'age': [10, 12, 13, 14, 15, 16],
                      'test_score': [7,6,4,3,2,1]})
df_pivot

Unnamed: 0,age,class,gender,test_score
0,10,A,girl,7
1,12,B,girl,6
2,13,C,girl,4
3,14,A,guy,3
4,15,B,guy,2
5,16,C,guy,1


In [41]:
df_pivot.pivot(index='gender', columns='class')['test_score']

class,A,B,C
gender,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
girl,7,6,4
guy,3,2,1


Please Note: Duplicate entries do not work

In [42]:
df_pivot = pd.DataFrame({'gender': ['girl','girl','girl','guy','guy','guy'],
                       'class': ['A', 'A', 'C', 'A', 'B', 'C'],
                       'age': [10, 12, 13, 14, 15, 16],
                      'test_score': [7,6,4,3,2,1]})
df_pivot.pivot(index='gender', columns='class')['test_score']

ValueError: Index contains duplicate entries, cannot reshape

## ** df.merge() **

In [43]:
df_1 = pd.DataFrame({'gender': ['girl','girl','','guy','guy','guy'],
                       'class': ['A', 'B', 'C', 'A', 'B', 'C'],
                       'age': [10, 12, 13, 14, 15, 16],
                      'test_score': [7,6,4,3,2,1]})
df_1

Unnamed: 0,age,class,gender,test_score
0,10,A,girl,7
1,12,B,girl,6
2,13,C,,4
3,14,A,guy,3
4,15,B,guy,2
5,16,C,guy,1


In [44]:
df_2 = pd.DataFrame({'gender': ['girl','guy'],
                       'hair_length': [ 'long', 'short']})
df_2

Unnamed: 0,gender,hair_length
0,girl,long
1,guy,short


<img src="C:\Users\smart\OneDrive - Singapore Management University\SMU\BIA\Curriculum\Workshop-Github\PythonWorkshop\resources\merge-image.png"/>

In [45]:
df_1.merge(df_2, on='gender', how='outer')

Unnamed: 0,age,class,gender,test_score,hair_length
0,10,A,girl,7,long
1,12,B,girl,6,long
2,13,C,,4,
3,14,A,guy,3,short
4,15,B,guy,2,short
5,16,C,guy,1,short


In [46]:
df_1.merge(df_2, on='gender', how='inner')

Unnamed: 0,age,class,gender,test_score,hair_length
0,10,A,girl,7,long
1,12,B,girl,6,long
2,14,A,guy,3,short
3,15,B,guy,2,short
4,16,C,guy,1,short


In [47]:
df_1.merge(df_2, on='gender', how='left')

Unnamed: 0,age,class,gender,test_score,hair_length
0,10,A,girl,7,long
1,12,B,girl,6,long
2,13,C,,4,
3,14,A,guy,3,short
4,15,B,guy,2,short
5,16,C,guy,1,short


In [48]:
df_1.merge(df_2, on='gender', how='right')

Unnamed: 0,age,class,gender,test_score,hair_length
0,10,A,girl,7,long
1,12,B,girl,6,long
2,14,A,guy,3,short
3,15,B,guy,2,short
4,16,C,guy,1,short


## df.info() 

Gives me details about the number of rows, the data types, and is useful to see if there are missing values

In [49]:
df.head()

Unnamed: 0,year,school,course_type,course_name,gender,no_of_students
0,2017,School of Business & Accountancy,Full-time,Diploma in Accountancy,Female,468
1,2017,School of Business & Accountancy,Full-time,Diploma in Accountancy,Male,404
2,2017,School of Business & Accountancy,Full-time,Diploma in Banking & Financial Services,Female,126
3,2017,School of Business & Accountancy,Full-time,Diploma in Banking & Financial Services,Male,180
4,2017,School of Business & Accountancy,Full-time,Diploma in Business Information Technology,Female,68


In [50]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 166 entries, 0 to 165
Data columns (total 6 columns):
year              166 non-null int64
school            166 non-null object
course_type       166 non-null object
course_name       166 non-null object
gender            166 non-null object
no_of_students    166 non-null int64
dtypes: int64(2), object(4)
memory usage: 7.9+ KB


## df.describe() 

Gives details about the *numeric* variables

In [51]:
df.describe()

Unnamed: 0,year,no_of_students
count,166.0,166.0
mean,2017.0,100.301205
std,0.0,148.295139
min,2017.0,1.0
25%,2017.0,16.75
50%,2017.0,54.0
75%,2017.0,111.5
max,2017.0,1236.0


## df.set_index() and df.reset_index()

In [64]:
#Use the `inplace` parameter to change the dataframe and store it in itself
#instead of creating a new DF
df.set_index('course_name', inplace=True)
df.head()

KeyError: 'course_name'

In [53]:
df.reset_index(inplace=True)
df.head()

Unnamed: 0,course_name,year,school,course_type,gender,no_of_students
0,Diploma in Accountancy,2017,School of Business & Accountancy,Full-time,Female,468
1,Diploma in Accountancy,2017,School of Business & Accountancy,Full-time,Male,404
2,Diploma in Banking & Financial Services,2017,School of Business & Accountancy,Full-time,Female,126
3,Diploma in Banking & Financial Services,2017,School of Business & Accountancy,Full-time,Male,180
4,Diploma in Business Information Technology,2017,School of Business & Accountancy,Full-time,Female,68


# Locating rows in Pandas Dataframe

## df.iloc[]

We can select the rows by position if we use df.iloc()
For example, df.iloc([2,3,4]) will return the 2nd, 3rd, and 4th rows

In [61]:
df.iloc[[0,1,2,3]]

Unnamed: 0_level_0,year,school,course_type,gender,no_of_students
course_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Diploma in Accountancy,2017,School of Business & Accountancy,Full-time,Female,468
Diploma in Accountancy,2017,School of Business & Accountancy,Full-time,Male,404
Diploma in Banking & Financial Services,2017,School of Business & Accountancy,Full-time,Female,126
Diploma in Banking & Financial Services,2017,School of Business & Accountancy,Full-time,Male,180


## df.loc[]

We can select the rows by `label` if we use df.loc() For example, df.loc([2,3,4]) will return the rows with index=2,3,4

In [56]:
df.set_index('course_name', inplace=True)

In [57]:
df.head()

Unnamed: 0_level_0,year,school,course_type,gender,no_of_students
course_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Diploma in Accountancy,2017,School of Business & Accountancy,Full-time,Female,468
Diploma in Accountancy,2017,School of Business & Accountancy,Full-time,Male,404
Diploma in Banking & Financial Services,2017,School of Business & Accountancy,Full-time,Female,126
Diploma in Banking & Financial Services,2017,School of Business & Accountancy,Full-time,Male,180
Diploma in Business Information Technology,2017,School of Business & Accountancy,Full-time,Female,68


In [63]:
#Returns the rows with index = Diploma in Accountancy
df.loc['Diploma in Accountancy']

Unnamed: 0_level_0,year,school,course_type,gender,no_of_students
course_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Diploma in Accountancy,2017,School of Business & Accountancy,Full-time,Female,468
Diploma in Accountancy,2017,School of Business & Accountancy,Full-time,Male,404


In [65]:
#Returns columns course_type, gender of rows with index = Diploma in Accountancy
df.loc[['Diploma in Accountancy'], ['course_type', 'gender']]

Unnamed: 0_level_0,course_type,gender
course_name,Unnamed: 1_level_1,Unnamed: 2_level_1
Diploma in Accountancy,Full-time,Female
Diploma in Accountancy,Full-time,Male


In [66]:
#returns all rows with gender = male
df.loc[df['gender']=='Male'].head()

Unnamed: 0_level_0,year,school,course_type,gender,no_of_students
course_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Diploma in Accountancy,2017,School of Business & Accountancy,Full-time,Male,404
Diploma in Banking & Financial Services,2017,School of Business & Accountancy,Full-time,Male,180
Diploma in Business Information Technology,2017,School of Business & Accountancy,Full-time,Male,106
Diploma in Business Studies,2017,School of Business & Accountancy,Full-time,Male,403
Diploma in International Business,2017,School of Business & Accountancy,Full-time,Male,61


In [67]:
df = df.reset_index()
df.head()

Unnamed: 0,course_name,year,school,course_type,gender,no_of_students
0,Diploma in Accountancy,2017,School of Business & Accountancy,Full-time,Female,468
1,Diploma in Accountancy,2017,School of Business & Accountancy,Full-time,Male,404
2,Diploma in Banking & Financial Services,2017,School of Business & Accountancy,Full-time,Female,126
3,Diploma in Banking & Financial Services,2017,School of Business & Accountancy,Full-time,Male,180
4,Diploma in Business Information Technology,2017,School of Business & Accountancy,Full-time,Female,68


Exercise: How many students are present in Diploma in Banking & Financial Services

In [None]:
df[df['course_name']=='Diploma in Banking & Financial Services']['no_of_students'].sum()

# Saving a Dataframe

## Saving to a csv file

In [None]:
df.to_csv('updated_csv.csv', encoding='utf-8')

## Saving to an Excel Workbook

In [None]:
from pandas import ExcelWriter
writer = ExcelWriter('updated_xlsx.xlsx')
df.to_excel(writer, 'Sheet1')
df_1.to_excel(writer, 'Sheet2')
writer.save()

## Saving to a Python dictionary

In [None]:
dictionary = df.to_dict()
dictionary