<a href="https://colab.research.google.com/github/werowe/HypatiaAcademy/blob/master/pandas/Pandas_Introduction_Video.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Pandas Introduction

> **Video Introduction**
>
> [Here is a video](https://www.youtube.com/watch?v=3QSzdL6ikLI) that goes with this Pandas introduction

> **What to Read Next**
>
> The follow up to this notebook is a notebook that shows [how to search pandas data](https://github.com/werowe/HypatiaAcademy/blob/master/pandas/pandas_search.ipynb)

Here we show how to work with Python Pandas.  Python Pandas is the easiest way to work with many data types as it arranges it into rows and columns, similar to a spreadsheet.  Pandas is used as well in machine learning models.

Below we show:

* How to read data into a Pandas dataframe from a CSV file.
* Show how to print out the column names.
* See and set the index.
* Print sections of the dataframe.
* Do basic statistics.
* Create a dataframe from a dictionary.

In [16]:
# Here we read data from a CSV file into a pandas dataframe.  The result is data writte into a nice

import pandas as pd
import numpy as np

df=pd.read_csv("https://raw.githubusercontent.com/werowe/HypatiaAcademy/master/pandas/pandasdata.csv")
df



Unnamed: 0,StudentName,Grade,Major,Advisor,StudentNumber
0,Fred,100,math,Mr Watts,123
1,Sam,92,English,Ms Smith,456
2,Sally,84,computers,Mr Watts,789
3,Susan,93,math,Ms Smith,101
4,Arthur,74,biology,Mr Watts,102


In [17]:
# Pandas got its column names from the top line in the .csv file.  If the csv file has no header you can
# supply that manually.

df.columns

Index(['StudentName', 'Grade', 'Major', 'Advisor', 'StudentNumber'], dtype='object')

In [18]:
df['StudentName']

0      Fred
1       Sam
2    Sally 
3     Susan
4    Arthur
Name: StudentName, dtype: object

In [19]:
# The index when you don't supply one is just the numbers  0, 1, 2, 3, ...   That is called a RangeIndex
df.index

RangeIndex(start=0, stop=5, step=1)

In [20]:
# If you have some kind of unique value, in this case, Student Number, then you could use that as an index.
# That makes it easier to organize and look up date.

df.set_index(df['StudentNumber'],inplace=True)

In [21]:
df

Unnamed: 0_level_0,StudentName,Grade,Major,Advisor,StudentNumber
StudentNumber,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
123,Fred,100,math,Mr Watts,123
456,Sam,92,English,Ms Smith,456
789,Sally,84,computers,Mr Watts,789
101,Susan,93,math,Ms Smith,101
102,Arthur,74,biology,Mr Watts,102


In [22]:
# Now that we have made the StudentNumber and index we can look up a single row using the index value, i.e., student numner 123.

df.loc[123]

StudentName          Fred
Grade                 100
Major                math
Advisor          Mr Watts
StudentNumber         123
Name: 123, dtype: object

In [23]:
# Info gives you data types.

df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 5 entries, 123 to 102
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   StudentName    5 non-null      object
 1   Grade          5 non-null      int64 
 2   Major          5 non-null      object
 3   Advisor        5 non-null      object
 4   StudentNumber  5 non-null      int64 
dtypes: int64(2), object(3)
memory usage: 412.0+ bytes


In [24]:
# You can make a Pandas dataframe in many ways.  Here we use the constructor to create one from a dictionary.  The values are arrays.

b = pd.DataFrame({
     "cars": ['mercedes', 'bmw'],
     "colors" : ['white', 'black']
})

b

Unnamed: 0,cars,colors
0,mercedes,white
1,bmw,black


In [25]:
# This prints the last two lines

df.tail(2)

Unnamed: 0_level_0,StudentName,Grade,Major,Advisor,StudentNumber
StudentNumber,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
101,Susan,93,math,Ms Smith,101
102,Arthur,74,biology,Mr Watts,102


In [26]:
# A single column from a dataframe is called a Series.  With a series you can do statistic functions.
# Here we use the describe() method to show the mean, std, etc.

df['Grade'].describe()

count      5.000000
mean      88.600000
std        9.939819
min       74.000000
25%       84.000000
50%       92.000000
75%       93.000000
max      100.000000
Name: Grade, dtype: float64

In [27]:
b

Unnamed: 0,cars,colors
0,mercedes,white
1,bmw,black


In [28]:
# concat means add two dataframes together

c = pd.DataFrame({
     "cars": ['tesla', 'toyota'],
     "colors" : ['white', 'black']
})

d=pd.concat([b,c])
d



Unnamed: 0,cars,colors
0,mercedes,white
1,bmw,black
0,tesla,white
1,toyota,black


In [29]:
df

Unnamed: 0_level_0,StudentName,Grade,Major,Advisor,StudentNumber
StudentNumber,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
123,Fred,100,math,Mr Watts,123
456,Sam,92,English,Ms Smith,456
789,Sally,84,computers,Mr Watts,789
101,Susan,93,math,Ms Smith,101
102,Arthur,74,biology,Mr Watts,102


In [30]:
# .loc() is one way to look up rows.  Hwere we lso use .loc[] which means only retrieve the first three rows.
# There are many ways to look up Pandas data.  We explain more about that here https://github.com/werowe/HypatiaAcademy/blob/master/pandas/pandas_search.ipynb


df.loc[df['Advisor'] == 'Mr Watts'].sort_values(by='Grade' ).iloc[0:3]



Unnamed: 0_level_0,StudentName,Grade,Major,Advisor,StudentNumber
StudentNumber,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
102,Arthur,74,biology,Mr Watts,102
789,Sally,84,computers,Mr Watts,789
123,Fred,100,math,Mr Watts,123
