# Introduction

## What is pandas?

Pandas is a fast, powerful, flexible and easy-to-use open source data analysis and manipulation tool built on top of the Python programming language.

It is also one of the most popular libraries used by data experts from all around the world.

## What can you do with pandas?

Pandas is used for data wrangling, data analysis and data visualisation.

Some examples include creating and merging dataframes, dropping unwanted columns and rows, locating and filling null values, grouping data by category, creating basic plots like barplot, scatter plot, histogram etc.

## Why should you learn to use pandas?

As humans interact more and more with technology, vast amounts of data are being generated each day. Hence, the ability to analyse these data and draw insights from them is becoming an increasingly important skill to have in the modern workforce. Organisations are progressively turning to data to help them better understand their customers and products, analyse past trends and patterns, improve operational efficiency and so on.

Here are just some of the many reasons why you should learn pandas:
- By learning pandas, you learn the fundamental ideas behind working with data as well as some skill and knowledge to code in Python
- It is straightforward to learn and you can immediately apply it to any dataset you want
- It is commonly used in the data science and machine learning community

## Where can you find pandas?

Best way to get access to pandas is by installing [Anaconda](https://docs.anaconda.com/anaconda/install/) which is a distribution of the Python and R programming languages, both of which are heavily used in data science.

By installing Anaconda, you will also have access to Jupyter notebook which is what I am using to write up this documentation. Jupyter notebook allows you to easily run your Python code cell by cell.

## What I hope to do with this video series?

This video series is going to be a complete beginner's course on how to use pandas. I won't expect that you have any prior knowledge or background in data science or even programming in general.  

Through this video series, I aim to pass on what I have learned about pandas thus far and furthermore inspire people to incorporate pandas into their future data analysis work whether that is for their university assignment, side projects or professional work.

On your end, the best way to gain value out of this video series is by doing. Programming is just like driving - you don't learn how to drive merely by reading about it or watching a video of someone else do it, you have to actually do it yourself. So I highly encourage you to install Jupyter notebook on your computer and have a go at using pandas yourself after you finish watching my weekly content.

# Week 1: Reading csv files & creating your own dataframe

To use pandas, we have to first import the pandas library and the way you do that is as follows

In [None]:
# Import pandas and label it as 'pd'
#from pandas.core.computation.check import NUMEXPR_INSTALLED
import pandas as pd

## Reading csv files

For this part of the tutorial, you will need to download the [titanic](https://www.kaggle.com/c/titanic/data) dataset on kaggle. Once you have downloaded the file, unzip the file i.e. extract its content out of the file. Keep in mind where the file is on your compute because as we need to specify the location of the file in Jupyter notebook in order to load the data.

In [None]:
#df = pd.read_csv(r"E:\Academic\Work\Array\Interimediate Python\pandas-tutorial-master\train.csv")
#or     
#df = pd.read_csv("E:\\Academic\\Work\\Array\\Interimediate Python\pandas-tutorial-master\\train.csv")
#or 
#df = pd.read_csv("E:/Academic/Work/Array/Interimediate Python/pandas-tutorial-master/train.csv")
 


In [None]:
#df = pd.read_csv("E:\\Academic\\Work\\Array\\Interimediate Python\\pandas-tutorial-master\\train.csv") 
#df1= pd.read_csv("E:/Academic/Work/Array/Interimediate Python/pandas-tutorial-master/train.csv")
#df2= pd.read_csv(r"E:\Academic\Work\Array\Interimediate Python\pandas-tutorial-master\train.csv")

In [None]:
# Read data via 'pd.read_csv'
# Use the appropriate read function for different file formats, for example pd.read_excel allows you to import files in excel format
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
# pd.read_json("")
#pd.read_execl("")
#pd.read_sql("")

Let's have a look at our datasets

In [None]:
train

In [None]:
train1 = train.copy()
train1

In [None]:
train.set_index("PassengerId") #inplace equal to train = train.set_index("PassengerId")
train

In [None]:
train.drop(index=887)

In [None]:
train.drop(columns="Cabin")

In [None]:
print(train.to_string())

In [None]:
#pd.options.display.max_rows = 99999

In [None]:
#train

In [None]:
# 'head' shows the first five rows of the dataframe by default but you can specify the number of rows in the parenthesis
train.head()

In [None]:
# 'tail' shows the bottom five rows by default
train.tail()

In [None]:
# 'shape' function tells us how many rows and columns exist in a dataframe
train.shape

In [None]:
train.info()
# object (Attributes ,, behavior ) > string > float64 > int(32 ,, 64)

## Creating your own dataframe

In [None]:
# Number entries
mydict = {'Student_ID': [154, 973, 645], 
          'Science': [50, 75, 31], 
          'Geography': [88, 100, 66],
          'Math': [72, 86,94]}

test_scores = pd.DataFrame(mydict)
test_scores

In [None]:
# Text entries

survey = pd.DataFrame({'James': ['I liked it', 'It could use a bit more salt'],
                       'Emily': ['It is too sweet', 'Yum!']})
survey

## Index

We can either set an existing column as our index or specify an index when creating a dataframe.

Let's begin by setting an an existing column as index.

In [None]:
test_scores = test_scores.set_index('Student_ID')
test_scores

Alternatively, we can specify an index column when creating a dataframe via the 'index' argument.

In [None]:
survey = pd.DataFrame({'James': ['I liked it', 'It could use a bit more salt'], 
                       'Emily': ['It is too sweet', 'Yum!']},
                     index = ['Product A', 'Product B'])
survey

In [None]:
#axis = 0 --> rows
#axis = 1 ---> columns

In [None]:
survey.drop("Product A", axis = 0)

In [None]:
survey.drop("Emily", axis=1)

In [None]:
survey

You can also reset the index back to its default.

In [None]:
# Reset index
# Try playing around with 'drop' and 'inplace' and see what they do
# survey = survey.reset_index(drop = True)
# or
survey.reset_index(drop = True, inplace = True)
survey

In [None]:
survey

In [None]:
train.head()

In [None]:
train.reset_index(inplace = True)
train.head()

In [None]:
train.rename(columns = {'PassengerId' : 'Passenger ID', 'SibSp': 'Siblings' }, inplace = True)
train.head()

## Renaming columns 

In [None]:
test_scores

In [None]:
# Suppose we want to change the names of the first two columns
test_scores.rename(columns = {'Geography': 'Physics', 'Science': 'Arts'}, inplace= True)
test_scores

## Dropping columns and rows

There are a few of ways you can drop columns or rows from your dataframe. In this example, I am only focusing on the 'drop' function.

In [None]:
# Drop the 'Math' column
test_scores.drop(columns ='Math')

In [None]:
test_scores

In [None]:
# Drop row with student_ID 973
# We can make this more robust once we learn the 'loc' function in the coming weeks 
test_scores.drop(index = 973)

In [None]:
x=test_scores.drop(645)
x

## Adding columns and rows

In [None]:
test_scores

In [None]:
# Create a new column for history subject
test_scores['History'] = [79, 70, 67]
test_scores

In [None]:
train["family relations"] = train["Siblings"] + train["Parch"]
train.head()

## Series

There are two core objects in pandas, one is dataframe which we have already gone through, the other is called a series.

Dataframe, as we have seen, looks like a data table. A series on the other hand is a sequence of data values or sometimes called a list.

In [None]:
pd.Series([1, 2, 3, 4, 5])

You can think of series as being a single column within a dataframe and so we can assign a index label to a series just like how we would with a dataframe.

In [None]:
profit = pd.Series([75, 80, 66], index = ['2018 Profit', '2019 Profit', '2020 Profit'])
profit

Using this same logic, we can form a dataframe using a list of list i.e. a combination of series. Let's see how we can do that.

In [None]:
#3x3 matrix
[[1,2,3],
[4,5,6],
[7,8,9]]

In [None]:
customer_sales = pd.DataFrame([[317.1, 'Melbourne', '80'], 
                               [887, 'New York', '91'], 
                               [225, 'London', '50']], 
                              columns = ['Customer_ID', 'City', 'Sales'])
customer_sales

Unlike before when we were creating our dataframe by column, when creating a dataframe using a series, a single list corresponds to a single row in the dataframe.