## What is pandas?

Pandas is a fast, powerful, flexible and easy-to-use open source data analysis and manipulation tool built on top of the Python programming language.

It is also one of the most popular libraries used by data experts from all around the world.

# Introduction

## What can you do with pandas?

Pandas is used for data wrangling, data analysis and data visualisation.

Some examples include creating and merging dataframes, dropping unwanted columns and rows, locating and filling null values, grouping data by category, creating basic plots like barplot, scatter plot, histogram etc.

## Why should you learn to use pandas?

As humans interact more and more with technology, vast amounts of data are being generated each day. Hence, the ability to analyse these data and draw insights from them is becoming an increasingly important skill to have in the modern workforce. Organisations are progressively turning to data to help them better understand their customers and products, analyse past trends and patterns, improve operational efficiency and so on.

Here are just some of the many reasons why you should learn pandas:
- By learning pandas, you learn the fundamental ideas behind working with data as well as some skill and knowledge to code in Python
- It is straightforward to learn and you can immediately apply it to any dataset you want
- It is commonly used in the data science and machine learning community

## Where can you find pandas?

Best way to get access to pandas is by installing [Anaconda](https://docs.anaconda.com/anaconda/install/) which is a distribution of the Python and R programming languages, both of which are heavily used in data science.

By installing Anaconda, you will also have access to Jupyter notebook which is what I am using to write up this documentation. Jupyter notebook allows you to easily run your Python code cell by cell.

## What I hope to do with this video series?

This video series is going to be a complete beginner's course on how to use pandas. I won't expect that you have any prior knowledge or background in data science or even programming in general.  

Through this video series, I aim to pass on what I have learned about pandas thus far and furthermore inspire people to incorporate pandas into their future data analysis work whether that is for their university assignment, side projects or professional work.

On your end, the best way to gain value out of this video series is by doing. Programming is just like driving - you don't learn how to drive merely by reading about it or watching a video of someone else do it, you have to actually do it yourself. So I highly encourage you to install Jupyter notebook on your computer and have a go at using pandas yourself after you finish watching my weekly content.

# Part 1: Reading csv files & creating your own dataframe

To use pandas, we have to first import the pandas library and the way you do that is as follows

In [1]:
# Import pandas and label it as 'pd'

import pandas as pd

## Reading csv files

For this part of the tutorial, you will need to download the [titanic](https://www.kaggle.com/c/titanic/data) dataset on kaggle. Once you have downloaded the file, unzip the file i.e. extract its content out of the file. Keep in mind where the file is on your compute because as we need to specify the location of the file in Jupyter notebook in order to load the data.

In [2]:
# Read data via 'pd.read_csv'
# Use the appropriate read function for different file formats, for example pd.read_excel allows you to import files in excel format

train = pd.read_csv("titanic/train.csv")
test = pd.read_csv("titanic/test.csv")

Let's have a look at our datasets

In [8]:
# 'head' shows the first five rows of the dataframe by default but you can specify the number of rows in the parenthesis

train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [9]:
# 'tail' shows the bottom five rows by default

test.tail()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
413,1305,3,"Spector, Mr. Woolf",male,,0,0,A.5. 3236,8.05,,S
414,1306,1,"Oliva y Ocana, Dona. Fermina",female,39.0,0,0,PC 17758,108.9,C105,C
415,1307,3,"Saether, Mr. Simon Sivertsen",male,38.5,0,0,SOTON/O.Q. 3101262,7.25,,S
416,1308,3,"Ware, Mr. Frederick",male,,0,0,359309,8.05,,S
417,1309,3,"Peter, Master. Michael J",male,,1,1,2668,22.3583,,C


In [4]:
# 'shape' function tells us how many rows and columns exist in a dataframe

train.shape

891

## Creating your own dataframe

In [13]:
# Number entries

test_scores = pd.DataFrame({'Student_ID': [154, 973, 645], 'Science': [50, 75, 31], 'Geography': [88, 100, 66],
                            'Math': [72, 86, 94]})
test_scores

Unnamed: 0,Student_ID,Science,Geography,Math
0,154,50,88,72
1,973,75,100,86
2,645,31,66,94


In [12]:
# Text entries

survey = pd.DataFrame({'James': ['I liked it', 'It could use a bit more salt'], 'Emily': ['It is too sweet', 'Yum!']})
survey

Unnamed: 0,James,Emily
0,I liked it,It is too sweet
1,It could use a bit more salt,Yum!


## Index

We can either set an existing column as our index or specify an index when creating a dataframe.

Let's begin by setting an an existing column as index.

In [14]:
test_scores = test_scores.set_index('Student_ID')
test_scores

Unnamed: 0_level_0,Science,Geography,Math
Student_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
154,50,88,72
973,75,100,86
645,31,66,94


Alternatively, we can specify an index column when creating a dataframe via the 'index' argument.

In [15]:
survey = pd.DataFrame({'James': ['I liked it', 'It could use a bit more salt'], 'Emily': ['It is too sweet', 'Yum!']},
                      index=['Product A', 'Product B'])
survey

Unnamed: 0,James,Emily
Product A,I liked it,It is too sweet
Product B,It could use a bit more salt,Yum!


You can also reset the index back to its default.

In [16]:
# Reset index
# Try playing around with 'drop' and 'inplace' and see what they do
# inplace=True => if we want to commit the changes to the existing dataframe

survey.reset_index(drop=True, inplace=True)
survey

Unnamed: 0,James,Emily
0,I liked it,It is too sweet
1,It could use a bit more salt,Yum!


## Renaming columns 

In [17]:
# Suppose we want to change the names of the first two columns

test_scores.rename(columns={'Geography': 'Physics', 'Science': 'Arts'}, inplace=True)
test_scores

Unnamed: 0_level_0,Arts,Physics,Math
Student_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
154,50,88,72
973,75,100,86
645,31,66,94


## Dropping columns and rows

There are a few of ways you can drop columns or rows from your dataframe. In this example, I am only focusing on the 'drop' function.

In [18]:
# Drop the 'Math' column

test_scores.drop(columns='Math')

Unnamed: 0_level_0,Arts,Physics
Student_ID,Unnamed: 1_level_1,Unnamed: 2_level_1
154,50,88
973,75,100
645,31,66


In [19]:
# Drop row with student_ID 973
# We can make this more robust once we learn the 'loc' function in the coming weeks 

test_scores.drop(973)

Unnamed: 0_level_0,Arts,Physics,Math
Student_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
154,50,88,72
645,31,66,94


## Adding columns and rows

In [19]:
test_scores

Unnamed: 0_level_0,Arts,Physics,Math
Student_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
154,50,88,72
973,75,100,86
645,31,66,94


In [20]:
# Create a new column for history subject

test_scores['History'] = [79, 70, 67]
test_scores

Unnamed: 0_level_0,Arts,Physics,Math,History
Student_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
154,50,88,72,79
973,75,100,86,70
645,31,66,94,67


In [22]:
# Add more product reviews from James and Emily
# Recall our survey dataframe

survey

Unnamed: 0,James,Emily
0,I liked it,It is too sweet
1,It could use a bit more salt,Yum!


In [23]:
# Create two more rows

df = pd.DataFrame({'James': ['Not good', 'Meh'], 'Emily': ['My grandma can cook better', 'Pretty average']})
df

Unnamed: 0,James,Emily
0,Not good,My grandma can cook better
1,Meh,Pretty average


In [24]:
# Use the 'append' function

survey = survey._append(df, ignore_index=True)
survey

Unnamed: 0,James,Emily
0,I liked it,It is too sweet
1,It could use a bit more salt,Yum!
2,Not good,My grandma can cook better
3,Meh,Pretty average


## Series

There are two core objects in pandas, one is dataframe which we have already gone through, the other is called a series.

Dataframe, as we have seen, looks like a data table. A series on the other hand is a sequence of data values or sometimes called a list.

In [27]:
pd.Series([1, 2, 3, 4, 5])

0    1
1    2
2    3
3    4
4    5
dtype: int64

You can think of series as being a single column within a dataframe and so we can assign a index label to a series just like how we would with a dataframe.

In [28]:
profit = pd.Series([75, 80, 66], index=['2018 Profit', '2019 Profit', '2020 Profit'])
profit

2018 Profit    75
2019 Profit    80
2020 Profit    66
dtype: int64

Using this same logic, we can form a dataframe using a list of list i.e. a combination of series. Let's see how we can do that.

In [31]:
customer_sales = pd.DataFrame([[317, 'Melbourne', '80'], [887, 'New York', '91'], [225, 'London', '50']],
                              columns=['Customer_ID', 'City', 'Sales'])
customer_sales

Unnamed: 0,Customer_ID,City,Sales
0,317,Melbourne,80
1,887,New York,91
2,225,London,50


Unlike before when we were creating our dataframe by column, when creating a dataframe using a series, a single list corresponds to a single row in the dataframe.