# Chapter 6: Pandas

Now that you are familiar with python, it is time to dive deeper into what you can do with it

<b>Pandas</b> is an open source library providing high-performance, easy-to-use data structures and data analysis tools for Python

It is the most famous open source library for python and data analytics

<a href="https://pandas.pydata.org/pandas-docs/stable/api.html">Here</a> is the API for pandas. 

The first thing we need to do is import the library
You can use a simple import statement 

## Importing pandas

In [None]:
import pandas as pd
#By saying "as pd" we are naming this library pd in our code so we don't have to always use pandas
#You can choose anything other than "pd" also

## Series

A `Series` is a one-dimensional  <b>object</b> similar to an array, list, or column in a table. It will assign a labeled index[0,1...n]
to each item in the Series.

In [None]:
s = pd.Series(['Apple', 'Banana', 43, 65.6, 'Final'])
print(s)
print("The first element is", s[0])

In [None]:
test_list = ["2", 4334]
print(type(test_list))

<b>Change index in series</b>

In [None]:
s = pd.Series(['Apple', 'Banana', 'Guava', 'Tomato', 'Potato'], index=['1', '2', '3', '4', '5'])
s

In [None]:
print(s[1])
print(s['1'])

You can change index to just about anything. It only depends on how you want to refer to that record

In [None]:
s_new = pd.Series(['Apple', 'Banana', 'Guava', 'Tomato', 'Potato'], index=['Fruit1', 'Fruit2', 'Fruit3', 'Veg1', 'Veg2'])
print(s_new)

<b>You can use the index to select specific items from the Series ...</b>

In [None]:
print(s_new['Veg1'])
#However, using s_new[3] will still give the 5th element because 4 is not an index
print(s_new[4])

<b> You can use multiple indices </b>

In [None]:
#There are 2 square brackets here because you are sending in a list [a,b,c] to a series 
#So it's the equivalent of s_new[list]
s_new[['Fruit1', 'Fruit2', 'Fruit3']]

The `Series` constructor can convert a dictonary as well, using the keys of the dictionary as its index.

In [None]:
dictionary = {'Chicago': 1000, 'New York': 1300, 'Portland': 900, 'San Francisco': 1100,
     'Austin': 450, 'Boston': None}
cities = pd.Series(dictionary)
cities

<b> You can use boolean indexing for selection. </b>

In [None]:
cities[cities < 1000]

You can also change the values in a Series on the fly.

In [None]:
# changing based on the index
print('Old value:', cities['Chicago'])
cities['Chicago'] = 1400
print('New value:', cities['Chicago'])

What if you aren't sure whether an item is in the Series? You can check using the following statement.

In [None]:
print('Seattle' in cities)
print('San Francisco' in cities)

#You can also store in a variable like this
is_seattle_in_cities = 'Seattle' in cities
print(is_seattle_in_cities)

## PIT STOP

Here, let us revise what we've gone through with Series 

In [None]:
#Create a new dictionary using the name:height of the guy/girl on your left or guy/girl on your right and your name:height
#E.g.  {"Shubham": 183, "Gabriel": 144}

In [None]:
#Now convert this dictionary into a pandas series

In [None]:
#Print the name of the person who is taller

In [None]:
#Print the height of the people who are below 150cm

In [None]:
#Check if "Kim" is in your series

## DataFrames

<img src="images/dataFrame.jpg"/>

A `DataFrame` (Table) is made up of a few components

* index - Think of it like column that contains the id for the row. In this data set, there is no index
* column


In [None]:
#DataFrame({col1: {row1: value11, row2: value12},
#           col2: {row2: value21, row2: value22}})
df = pd.DataFrame({
        'A': {0: 'a', 1: 'b', 2: 'c'},
        'B': {0: 1, 1: 3, 2: 5},
        'C': {0: 2, 1: 4, 2: 6}})
df

But don't worry! You will never have to create a DataFrame from scratch
You will generally have to upload it from a CSV or an Excel file
So let's learn how to do that

In [None]:
# You will be using this data in your exercise later on...
sal_df = pd.read_csv("data/Salaries.csv")
sal_df

In [None]:
#To read from an Excel file
df = pd.read_excel('data/enrollment.xlsx')
df

In [None]:
#Get the number of rows and columns in the DataFrame
df.shape

#To get number of rows,
df.shape[0]

## FUNCTIONS

The first thing a user does when he uploads a CSV/Excel file is get a feel at what data he's dealing with. 
So we print the data, but I don't need all the rows.
So I'll use a function called `.head()`

In [None]:
#df.head() prints the top 5 rows of data
df.head()

In [None]:
#You can also specify the number of rows that you want!
df.head(3)

After that, we would want to see as much information as possible about the datatypes and number of records

For this, we will use `df.info()`

In [None]:
df.info()

Now we see there is 1 column which has a null value.<br/>
 

So we want to drop that record because it will spoil our analysis<br/>
So we will use a function called `df.dropna()`

In [None]:
df.dropna(inplace=True)

#inplace=True means change the current object and don't create a new one
df.info()

Now we will see summary statistics of the numerical columns in the dataFrame<br/>
Use `df.describe()`

In [None]:
df.describe()

You can also use statistics functions on various numerical columns to get the values

In [None]:
mean = df['no_of_students'].mean()
std_dev = df['no_of_students'].std()
min = df['no_of_students'].min()
max = df['no_of_students'].max()
p25 = df['no_of_students'].quantile(0.5)

print(mean, std_dev, min, max, p25)


## Common Operations on DataFrames

** Accessing a table **

In [None]:
type(df['year'])

In [None]:
# table_variable['column_name']

df['year'].head(5)


In [None]:
#accessing multiple columns
df[['year', 'no_of_students']].head()

In [None]:
#Get all unique course_types
print(df['course_type'].unique())

<b>Boolean Indexing in DataFrames</b>

In [None]:
#Get all records where course name is Diploma in Banking & Financial Services
df[df['course_name']=='Diploma in Banking & Financial Services']

Exercise: Get number of students in part-time and full-time


In [None]:
full_time_students = df[df['course_type']=='Full-time']
full_time_students.head()

In [None]:
#Get no_of_Students column
full_time_students['no_of_students']

In [None]:
#get Sum
full_time_students['no_of_students'].sum()

In [None]:
full_time_sum = df[df['course_type']=='Full-time']['no_of_students'].sum()
part_time_sum = df[df['course_type']=='Part-time diploma']['no_of_students'].sum()

print(full_time_sum, part_time_sum)

In [None]:
#Exercise for you:Find number of students in School Of Engineering

**Modifying a Column**

In this example, we are setting the `no_of_students` to a fixed value of 5

In [None]:
new_df= df.copy()

#.copy() basically creates a new DF so that we don't keep editting the original DF

new_df['no_of_students'] = 5

new_df.head(5)


We can use existing columns. Like in excel where you have a formula for cell `C1` as

```
= A1 + B1
```

In pandas you would have

```
df['C'] = df['A'] + df['B']
```

Where the formula applies to the entire column

In [None]:
new_df = df.copy()

# In this example, you are adding 1 to the existing no_of_students column

print(df.head()[['course_name', 'no_of_students']])

new_df['no_of_students'] = new_df['no_of_students'] + 1
print("\n")

print(new_df.head()[['course_name', 'no_of_students']])

<b> Sorting values </b>

In [None]:
#Sort df according to no_of_students
new_df = df.copy()
new_df.sort_values(['no_of_students'], ascending=False, inplace=True)
new_df.head()

<b>Let's play around with sal_df now</b>

** What is the name of lowest paid person (including benefits)? Do you notice something strange about how much he or she is paid?**

In [None]:
lowest = sal_df['TotalPayBenefits'].min()
lowest

In [None]:
sal_df[sal_df['TotalPayBenefits'] == lowest]

In [None]:
sal_df[sal_df['TotalPayBenefits'] == lowest]['EmployeeName']

In [None]:
len(sal_df['JobTitle'].unique())

In [None]:
sal_df.head()

# Saving a Dataframe

## Saving to a csv file

Note: Do not forget to add the ".csv"

In [None]:
df.to_csv('data/updated_csv.csv', encoding='utf-8')

## Saving to an Excel Workbook

In [None]:
from pandas import ExcelWriter
writer = ExcelWriter('data/updated_xlsx.xlsx')
df.to_excel(writer, 'Sheet1')
writer.save()

## Saving to a Python dictionary

In [None]:
dictionary = df.to_dict()
dictionary

In [None]:
sal_df['Year'].unique()

<b> Fun exercise! Let's plot a simple GRAPH </b>

In [None]:
import matplotlib.pyplot as plt
sal_df_copy = sal_df.copy()

sal_df_copy_grouped = sal_df_copy.groupby('Year').sum().reset_index()
sal_df_copy_grouped.plot(kind='line', x='Year', y=['TotalPay', 'TotalPayBenefits'], xticks=sal_df_copy_grouped['Year'])


### Additional things you can self learn

1. Pivot (df.pivot)
2. Plotting charts (matplotlib library)
3. Group by