# Pandas

In [3]:
import pandas as pd

## Series

A `Series` is a one-dimensional object similar to an array, list, or column in a table. It will assign a labeled index to each item in the Series.

In [10]:
s = pd.Series(['Apple', 'Banana', 43, 65.6, 'Final'])
s

0     Apple
1    Banana
2        43
3      65.6
4     Final
dtype: object

The `Series` constructor can convert a dictonary as well, using the keys of the dictionary as its index.

In [11]:
dictionary = {'Chicago': 1000, 'New York': 1300, 'Portland': 900, 'San Francisco': 1100,
     'Austin': 450, 'Boston': None}
cities = pd.Series(dictionary)
cities

Austin            450.0
Boston              NaN
Chicago          1000.0
New York         1300.0
Portland          900.0
San Francisco    1100.0
dtype: float64

You can use the index to select specific items from the Series ...

In [12]:
cities['Chicago']


1000.0

In [13]:
#Use multiple items
cities[['Chicago', 'Portland', 'San Francisco']]


Chicago          1000.0
Portland          900.0
San Francisco    1100.0
dtype: float64

Or you can use boolean indexing for selection.

In [14]:
cities[cities < 1000]


Austin      450.0
Portland    900.0
dtype: float64

You can also change the values in a Series on the fly.

In [15]:
# changing based on the index
print('Old value:', cities['Chicago'])
cities['Chicago'] = 1400
print('New value:', cities['Chicago'])

Old value: 1000.0
New value: 1400.0


What if you aren't sure whether an item is in the Series? You can check using idiomatic Python.

In [16]:
print('Seattle' in cities)
print('San Francisco' in cities)

False
True


## DataFrames

A `DataFrame` (Table) is made up of a few components

* index - Think of it like column that contains the id for the row. In this data set, there is no index
* column


In [19]:
#DataFrame({col1: {row1: value11, row2: value12},
#           col2: {row2: value21, row2: value22}})
df = pd.DataFrame({
        'A': {0: 'a', 1: 'b', 2: 'c'},
        'B': {0: 1, 1: 3, 2: 5},
        'C': {0: 2, 1: 4, 2: 6}})
df

Unnamed: 0,A,B,C
0,a,1,2
1,b,3,4
2,c,5,6


In [20]:
df.index.tolist()


[0, 1, 2]

You can easily import data from an excel file to a jupyter notebook using `read_excel`

In [40]:
df = pd.read_excel('C:/Users/smart/OneDrive - Singapore Management University/SMU/BIA/Curriculum/Workshop-Github/PythonWorkshop/resources/enrollment.xlsx')

#Print the top 5 rows using df.head(5)
df.head(5)

Unnamed: 0,year,school,course_type,course_name,gender,no_of_students
0,2017,School of Business & Accountancy,Full-time,Diploma in Accountancy,Female,468
1,2017,School of Business & Accountancy,Full-time,Diploma in Accountancy,Male,404
2,2017,School of Business & Accountancy,Full-time,Diploma in Banking & Financial Services,Female,126
3,2017,School of Business & Accountancy,Full-time,Diploma in Banking & Financial Services,Male,180
4,2017,School of Business & Accountancy,Full-time,Diploma in Business Information Technology,Female,68


With `read_table`, you can also create Dataframes using URLs

In [22]:
url = 'https://raw.github.com/gjreda/best-sandwiches/master/data/best-sandwiches-geocode.tsv'

# fetch the text from the URL and read it into a DataFrame
df_url = pd.read_table(url, sep='\t')
df_url.head(3)

Unnamed: 0,rank,sandwich,restaurant,description,price,address,city,phone,website,full_address,formatted_address,lat,lng
0,1,BLT,Old Oak Tap,The B is applewood smoked&mdash;nice and snapp...,$10,2109 W. Chicago Ave.,Chicago,773-772-0406,theoldoaktap.com,"2109 W. Chicago Ave., Chicago","2109 West Chicago Avenue, Chicago, IL 60622, USA",41.895734,-87.67996
1,2,Fried Bologna,Au Cheval,Thought your bologna-eating days had retired w...,$9,800 W. Randolph St.,Chicago,312-929-4580,aucheval.tumblr.com,"800 W. Randolph St., Chicago","800 West Randolph Street, Chicago, IL 60607, USA",41.884672,-87.647754
2,3,Woodland Mushroom,Xoco,Leave it to Rick Bayless and crew to come up w...,$9.50.,445 N. Clark St.,Chicago,312-334-3688,rickbayless.com,"445 N. Clark St., Chicago","445 North Clark Street, Chicago, IL 60654, USA",41.890602,-87.630925


## Common Operations on DataFrames

** Accessing a table **

In [27]:
# table_variable['column_name']

df['year'].head(5)

0    2017
1    2017
2    2017
3    2017
4    2017
Name: year, dtype: int64

**Modifying a Column**

In this example, we are setting the `no_of_students` to a fixed value of 5

In [28]:
new_df= df.copy()

new_df['no_of_students'] = 5

new_df.head(5)

Unnamed: 0,year,school,course_type,course_name,gender,no_of_students
0,2017,School of Business & Accountancy,Full-time,Diploma in Accountancy,Female,5
1,2017,School of Business & Accountancy,Full-time,Diploma in Accountancy,Male,5
2,2017,School of Business & Accountancy,Full-time,Diploma in Banking & Financial Services,Female,5
3,2017,School of Business & Accountancy,Full-time,Diploma in Banking & Financial Services,Male,5
4,2017,School of Business & Accountancy,Full-time,Diploma in Business Information Technology,Female,5


We can use existing columns. Like in excel where you have a formula for cell `C1` as

```
= A1 + B1
```

In pandas you would have

```
df['C'] = df['A'] + df['B']
```

Where the formula applies to the entire column

In [29]:
new_df = df.copy()

# In this example, you are adding 1 to the existing no_of_students column

new_df['no_of_students'] = new_df['no_of_students'] + 1

new_df.head(5)

Unnamed: 0,year,school,course_type,course_name,gender,no_of_students
0,2017,School of Business & Accountancy,Full-time,Diploma in Accountancy,Female,469
1,2017,School of Business & Accountancy,Full-time,Diploma in Accountancy,Male,405
2,2017,School of Business & Accountancy,Full-time,Diploma in Banking & Financial Services,Female,127
3,2017,School of Business & Accountancy,Full-time,Diploma in Banking & Financial Services,Male,181
4,2017,School of Business & Accountancy,Full-time,Diploma in Business Information Technology,Female,69


** Using .pivot() **

Reshape data (produce a “pivot” table) based on column values. Uses unique values from index / columns to form axes of the resulting DataFrame.

In [30]:
df = pd.DataFrame({'gender': ['girl','girl','girl','guy','guy','guy'],
                       'class': ['A', 'B', 'C', 'A', 'B', 'C'],
                       'age': [10, 12, 13, 14, 15, 16],
                      'test_score': [7,6,4,3,2,1]})
df

Unnamed: 0,age,class,gender,test_score
0,10,A,girl,7
1,12,B,girl,6
2,13,C,girl,4
3,14,A,guy,3
4,15,B,guy,2
5,16,C,guy,1


In [31]:
df.pivot(index='gender', columns='class')['test_score']

class,A,B,C
gender,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
girl,7,6,4
guy,3,2,1


** .merge() **

In [32]:
import numpy as np

#imported numpy just to use np.Nan feature below
df_1 = pd.DataFrame({'gender': ['girl','girl',np.NaN,'guy','guy','guy'],
                       'class': ['A', 'B', 'C', 'A', 'B', 'C'],
                       'age': [10, 12, 13, 14, 15, 16],
                      'test_score': [7,6,4,3,2,1]})
df_1

Unnamed: 0,age,class,gender,test_score
0,10,A,girl,7
1,12,B,girl,6
2,13,C,,4
3,14,A,guy,3
4,15,B,guy,2
5,16,C,guy,1


In [33]:
df_2 = pd.DataFrame({'gender': ['girl','guy'],
                       'hair_length': [ 'long', 'short']})
df_2

Unnamed: 0,gender,hair_length
0,girl,long
1,guy,short


<img src="resources/merge-image.jpg"/>

In [34]:
df_1.merge(df_2, on='gender', how='outer')

Unnamed: 0,age,class,gender,test_score,hair_length
0,10,A,girl,7,long
1,12,B,girl,6,long
2,13,C,,4,
3,14,A,guy,3,short
4,15,B,guy,2,short
5,16,C,guy,1,short


In [35]:
df_1.merge(df_2, on='gender', how='inner')

Unnamed: 0,age,class,gender,test_score,hair_length
0,10,A,girl,7,long
1,12,B,girl,6,long
2,14,A,guy,3,short
3,15,B,guy,2,short
4,16,C,guy,1,short


In [36]:
df_1.merge(df_2, on='gender', how='left')

Unnamed: 0,age,class,gender,test_score,hair_length
0,10,A,girl,7,long
1,12,B,girl,6,long
2,13,C,,4,
3,14,A,guy,3,short
4,15,B,guy,2,short
5,16,C,guy,1,short


In [37]:
df_1.merge(df_2, on='gender', how='right')

Unnamed: 0,age,class,gender,test_score,hair_length
0,10,A,girl,7,long
1,12,B,girl,6,long
2,14,A,guy,3,short
3,15,B,guy,2,short
4,16,C,guy,1,short


## df.info() 

Gives me details about the number of rows, the data types, and is useful to see if there are missing values

In [44]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 166 entries, 0 to 165
Data columns (total 6 columns):
year              166 non-null int64
school            166 non-null object
course_type       166 non-null object
course_name       166 non-null object
gender            166 non-null object
no_of_students    166 non-null int64
dtypes: int64(2), object(4)
memory usage: 7.9+ KB


## df.describe() 

Gives details about the *numeric* variables

In [45]:
df.describe()

Unnamed: 0,year,no_of_students
count,166.0,166.0
mean,2017.0,100.301205
std,0.0,148.295139
min,2017.0,1.0
25%,2017.0,16.75
50%,2017.0,54.0
75%,2017.0,111.5
max,2017.0,1236.0


## df.set_index() and df.reset_index()

In [48]:
#Use the `inplace` parameter to change the dataframe and store it in itself
#instead of creating a new DF
df.set_index('course_name', inplace=True)
df.head()

Unnamed: 0_level_0,year,school,course_type,gender,no_of_students
course_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Diploma in Accountancy,2017,School of Business & Accountancy,Full-time,Female,468
Diploma in Accountancy,2017,School of Business & Accountancy,Full-time,Male,404
Diploma in Banking & Financial Services,2017,School of Business & Accountancy,Full-time,Female,126
Diploma in Banking & Financial Services,2017,School of Business & Accountancy,Full-time,Male,180
Diploma in Business Information Technology,2017,School of Business & Accountancy,Full-time,Female,68


In [50]:
df.reset_index(inplace=True)
df.head()

Unnamed: 0,course_name,year,school,course_type,gender,no_of_students
0,Diploma in Accountancy,2017,School of Business & Accountancy,Full-time,Female,468
1,Diploma in Accountancy,2017,School of Business & Accountancy,Full-time,Male,404
2,Diploma in Banking & Financial Services,2017,School of Business & Accountancy,Full-time,Female,126
3,Diploma in Banking & Financial Services,2017,School of Business & Accountancy,Full-time,Male,180
4,Diploma in Business Information Technology,2017,School of Business & Accountancy,Full-time,Female,68


# Locating rows in Pandas Dataframe

## df.iloc[]

We can select the rows by position if we use df.iloc()
For example, df.iloc([2,3,4]) will return the 2nd, 3rd, and 4th rows

In [65]:
df.iloc[[1,2,3]]

Unnamed: 0,course_name,year,school,course_type,gender,no_of_students
1,Diploma in Accountancy,2017,School of Business & Accountancy,Full-time,Male,404
2,Diploma in Banking & Financial Services,2017,School of Business & Accountancy,Full-time,Female,126
3,Diploma in Banking & Financial Services,2017,School of Business & Accountancy,Full-time,Male,180


## df.loc[]

We can select the rows by `label` if we use df.loc() For example, df.loc([2,3,4]) will return the rows with index=2,3,4

In [71]:
df.set_index('course_name', inplace=True)

KeyError: 'course_name'

In [72]:
df.head()

Unnamed: 0_level_0,year,school,course_type,gender,no_of_students
course_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Diploma in Accountancy,2017,School of Business & Accountancy,Full-time,Female,468
Diploma in Accountancy,2017,School of Business & Accountancy,Full-time,Male,404
Diploma in Banking & Financial Services,2017,School of Business & Accountancy,Full-time,Female,126
Diploma in Banking & Financial Services,2017,School of Business & Accountancy,Full-time,Male,180
Diploma in Business Information Technology,2017,School of Business & Accountancy,Full-time,Female,68


In [73]:
df.loc['Diploma in Accountancy']

Unnamed: 0_level_0,year,school,course_type,gender,no_of_students
course_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Diploma in Accountancy,2017,School of Business & Accountancy,Full-time,Female,468
Diploma in Accountancy,2017,School of Business & Accountancy,Full-time,Male,404


In [75]:
df.loc[['Diploma in Accountancy'], ['course_type', 'gender']]

Unnamed: 0_level_0,course_type,gender
course_name,Unnamed: 1_level_1,Unnamed: 2_level_1
Diploma in Accountancy,Full-time,Female
Diploma in Accountancy,Full-time,Male


In [83]:
#boolean selecting with loc[]
df.loc[df['gender']=='Male'].head()

Unnamed: 0_level_0,year,school,course_type,gender,no_of_students
course_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Diploma in Accountancy,2017,School of Business & Accountancy,Full-time,Male,404
Diploma in Banking & Financial Services,2017,School of Business & Accountancy,Full-time,Male,180
Diploma in Business Information Technology,2017,School of Business & Accountancy,Full-time,Male,106
Diploma in Business Studies,2017,School of Business & Accountancy,Full-time,Male,403
Diploma in International Business,2017,School of Business & Accountancy,Full-time,Male,61
