### Introduction to pandas dataframe object

Pandas dataframe is a 2D tabular formatted data much like an Excel table with rows and columns. However, dataframes provide far more functionalities than excel, which we will see today.

In [1]:
# general imports. 
# we will be using pandas and numpy for this notebook
import pandas as pd
import numpy as np

In [2]:
# this is one of many ways of initalizing a pandas dataframe. For now, we are not concerned about these ways.
# focus on the key methods a dataframe provides

someData = pd.DataFrame(
                  np.array(
                        [[57, 55, 44, 81],
                        [82, 98, 71, 96],
                        [41, 41, 49, 81],
                        [97, 75, 61, 45],
                        [51, 90, 58, 67],
                        [40, 54, 47, 33]
                        ]),
                   index=[f'Student {i+1}' for i in range(0, 12, 2)],
                   columns=[i for i in ['Physics', 'Chemistry', 'Maths', 'Biology']])
print(someData)

            Physics  Chemistry  Maths  Biology
Student 1        57         55     44       81
Student 3        82         98     71       96
Student 5        41         41     49       81
Student 7        97         75     61       45
Student 9        51         90     58       67
Student 11       40         54     47       33


#### Things to note:
- The column names are strings. They can be integers as well.
- The row names (also called index of the dataframe) are strings. In most cases, row indices will be integers starting from $0$ untill $n-1$, where $n$ is the total number of rows.

In [3]:
# printing the total number of rows and columns in the dataset
someDataShape = someData.shape
print(f"someData has {someDataShape[0]} rows and {someDataShape[1]} columns") 
# notice that the 0th index has the row count and 1st index has column counts

someData has 6 rows and 4 columns


In [4]:
# display a general info of the dataset such as column names and datatypes 
print(someData.info())

<class 'pandas.core.frame.DataFrame'>
Index: 6 entries, Student 1 to Student 11
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype
---  ------     --------------  -----
 0   Physics    6 non-null      int32
 1   Chemistry  6 non-null      int32
 2   Maths      6 non-null      int32
 3   Biology    6 non-null      int32
dtypes: int32(4)
memory usage: 144.0+ bytes
None


In [5]:
# to display only the column names do
print(someData.columns)

Index(['Physics', 'Chemistry', 'Maths', 'Biology'], dtype='object')


### Slicing or selecting a specific column/row/cell in a dataframe

In [6]:
# to select a column simply use the matrix notation []. 
# Lets select the Physics column 

print(someData['Physics'])

Student 1     57
Student 3     82
Student 5     41
Student 7     97
Student 9     51
Student 11    40
Name: Physics, dtype: int32


In [7]:
# to select multiple columns say column Physics and Maths, pass a list of column names like

print(someData[['Physics' , 'Maths']])

            Physics  Maths
Student 1        57     44
Student 3        82     71
Student 5        41     49
Student 7        97     61
Student 9        51     58
Student 11       40     47


In [8]:
# selecting rows can be done using 2 ways
# 1st is integer locations using the .iloc() method 
# here rows are numbered starting from 0 until n-1

# so to select the 3rd row, we do
print(someData.iloc[2])

Physics      41
Chemistry    41
Maths        49
Biology      81
Name: Student 5, dtype: int32


In [9]:
# 2nd method is using the exact row names with .loc() method 
print(someData.loc['Student 5'])

# notice that both the methods give same result
# becasue 3rd row from the top has the row label 'Student 5'

Physics      41
Chemistry    41
Maths        49
Biology      81
Name: Student 5, dtype: int32


In [10]:
# to select a range of rows say top 3 rows, you can do

print('Top 3 rows\n')
print(someData.iloc[0:3])

# notice the 0 based indexing , although we write 0:3 python excludes the last index 
# and goes until index n-1. In this case, only rows 0 , 1 and 2 will be selected

# if you are selecting rows from the start, you can skip the starting position and the result will be same
print('\nTop 3 rows using shorter notation\n')
print(someData.iloc[:3])


# another very handy method to display top 'n' rows is the head() method
# this is quite useful to quickly view the first few rows of large datasets wih thousands of rows
print('\nTop 3 rows using head() method\n')
print(someData.head(3))

Top 3 rows

           Physics  Chemistry  Maths  Biology
Student 1       57         55     44       81
Student 3       82         98     71       96
Student 5       41         41     49       81

Top 3 rows using shorter notation

           Physics  Chemistry  Maths  Biology
Student 1       57         55     44       81
Student 3       82         98     71       96
Student 5       41         41     49       81

Top 3 rows using head() method

           Physics  Chemistry  Maths  Biology
Student 1       57         55     44       81
Student 3       82         98     71       96
Student 5       41         41     49       81


In [11]:
# to select last n rows, use negative indexing
# for selecting last 2 rows do
print("Last 2 rows of someData\n")
print(someData.iloc[-2:])

# notice the position of index, the starting index position comes before the :
# also the last index is blank because we are going untill the last row so instead of [-2 : 6] you can do [-2:]

# similar to head() we have the tail() method as well
print("\nLast 2 rows of someData using tail()\n")
print(someData.tail(2))

Last 2 rows of someData

            Physics  Chemistry  Maths  Biology
Student 9        51         90     58       67
Student 11       40         54     47       33

Last 2 rows of someData using tail()

            Physics  Chemistry  Maths  Biology
Student 9        51         90     58       67
Student 11       40         54     47       33


In [12]:
# selecting specific cell in the dataframe

# for score of Student 1 in Physics, do 
print(f"Score of Student 1 in Physics: {someData.loc['Student 1']['Physics']}")

# Or using iloc, you can do
print(f"Score of Student 1 in Physics: {someData.iloc[0]['Physics']}") # 0 row index is for Student 1

Score of Student 1 in Physics: 57
Score of Student 1 in Physics: 57


## Task 1
1. Run the next cell that generates a new dataframe called 'someMoreData'. Display its column names and data types. Using your output, comment on the 'Grade' column. Do you notice something irregular? Explain your observation in short (not more than 2-3 lines).
2. Display the first 4 students' scores in Chemistry , Biology and their Grades. Note that you should not display other columns, only the ones mentioned in this question.

In [13]:
someMoreData = someData.copy()
someMoreData['Grade'] = ['A', 'B+', 'B', 'C', 'B+', np.nan]
print(someMoreData)


            Physics  Chemistry  Maths  Biology Grade
Student 1        57         55     44       81     A
Student 3        82         98     71       96    B+
Student 5        41         41     49       81     B
Student 7        97         75     61       45     C
Student 9        51         90     58       67    B+
Student 11       40         54     47       33   NaN


In [14]:
NoOfColandRow = someMoreData.shape
print(someMoreData.info())
# your code here. For adding text explanations create a markdown cell below with question number.

<class 'pandas.core.frame.DataFrame'>
Index: 6 entries, Student 1 to Student 11
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   Physics    6 non-null      int32 
 1   Chemistry  6 non-null      int32 
 2   Maths      6 non-null      int32 
 3   Biology    6 non-null      int32 
 4   Grade      5 non-null      object
dtypes: int32(4), object(1)
memory usage: 364.0+ bytes
None


The datatype of Grade is Object and not int32, also 5 rows validates that the grade column count is'non-null', because 5 entries give the value of grades and 1 entry gives null value i.e. 'NaN'. 


In [15]:
# Q2. 
print(someMoreData.iloc[[0,1,2,3],[1,3]])

           Chemistry  Biology
Student 1         55       81
Student 3         98       96
Student 5         41       81
Student 7         75       45


### Filtering data using mathematical conditions

In DS, you will be performing a lot of analyses on the dataset, for instance you may want to know the grades of all students who scored more than 90 in Physics. We will discuss such usecases below.

In [16]:
# lets say we want to filter someMoreData to display all the students who scored less than 50 in Biology

print(f"Students scoring less than 50 in Biology are:\n\n{someMoreData[someMoreData['Biology'] < 50]}")

Students scoring less than 50 in Biology are:

            Physics  Chemistry  Maths  Biology Grade
Student 7        97         75     61       45     C
Student 11       40         54     47       33   NaN


In [17]:
# what about students scoring less than 50 in both Biology and Maths

print(f"Students scoring less than 50 in Biology and Maths are:\n\n{someMoreData[(someMoreData['Biology'] < 50) & (someMoreData['Maths'] < 50)]}")

# notice the chaining of multiple AND conditions. If you want to impose more conditions the chaining can grow longer 

Students scoring less than 50 in Biology and Maths are:

            Physics  Chemistry  Maths  Biology Grade
Student 11       40         54     47       33   NaN


In [18]:
# the above filtering was done on columns. 
# below we filter rows to get the same result

print(f"Students scoring less than 50 in Biology and Maths are:\n\n{someMoreData.loc[(someMoreData['Biology'] < 50) & (someMoreData['Maths'] < 50)]}")

# notice the syntax is almost same except the .loc() which says that the filtration is done on rows
# what filtration to apply depends on the situtaion, which we will see in upcoming modules.

Students scoring less than 50 in Biology and Maths are:

            Physics  Chemistry  Maths  Biology Grade
Student 11       40         54     47       33   NaN


## Task 2
- Display all the records in 'someMoreData' where a student has scored atleast 90 or more in Chemistry. You may use any filtration method discussed above.
- Display all the records in 'someMoreData' where a student got a C grade or scored less than 50 in Maths.  

In [19]:
# Q1. 
print(f"Students scoring 90 or more than 90 in Chemistry are:\n\n{someMoreData[someMoreData['Chemistry'] >= 90]}")

Students scoring 90 or more than 90 in Chemistry are:

           Physics  Chemistry  Maths  Biology Grade
Student 3       82         98     71       96    B+
Student 9       51         90     58       67    B+


In [20]:
# Q2. 
#print(f"Students scoring C grade or less than 50 in Maths are:\n\n{someMoreData[(someMoreData.Maths=='C') | (someMoreData['Maths'] < 50)]}")
someMoreData[(someMoreData.Maths<50) | (someMoreData.Grade=='C')]

Unnamed: 0,Physics,Chemistry,Maths,Biology,Grade
Student 1,57,55,44,81,A
Student 5,41,41,49,81,B
Student 7,97,75,61,45,C
Student 11,40,54,47,33,


## Mathematical operations and adding new columns to dataframes

In [21]:
# dataframes support arithmetic operations just like we would do normally. 
# for instance I can create a new 'Total' column 
# that stores the sum of all scores for a student 

someMoreData['Total'] = someMoreData['Physics'] + someMoreData['Chemistry'] + someMoreData['Maths'] + someMoreData['Biology']
print("someMoreData after adding the Total column\n")
print(someMoreData)

# notice the 'Total' column was added at the very end. 
# You can also insert a new column at specific position, feel free to google that yourself.

someMoreData after adding the Total column

            Physics  Chemistry  Maths  Biology Grade  Total
Student 1        57         55     44       81     A    237
Student 3        82         98     71       96    B+    347
Student 5        41         41     49       81     B    212
Student 7        97         75     61       45     C    278
Student 9        51         90     58       67    B+    266
Student 11       40         54     47       33   NaN    174


## Task 3
- Add a new column 'AverageScore' that stores the average score of each student.
- Now display the students whose 'AverageScore' is atleast 66.5

In [22]:
# Q1. 
#someMoreData.mean(axis='columns')
#someMoreData['AverageScore']=someMoreData.mean(axis='columns')
someMoreData['AverageScore']=(someMoreData.Total.div(4))
print(someMoreData)

            Physics  Chemistry  Maths  Biology Grade  Total  AverageScore
Student 1        57         55     44       81     A    237         59.25
Student 3        82         98     71       96    B+    347         86.75
Student 5        41         41     49       81     B    212         53.00
Student 7        97         75     61       45     C    278         69.50
Student 9        51         90     58       67    B+    266         66.50
Student 11       40         54     47       33   NaN    174         43.50


In [23]:
# Q2. 
someMoreData[(someMoreData.AverageScore>=66.5)]

Unnamed: 0,Physics,Chemistry,Maths,Biology,Grade,Total,AverageScore
Student 3,82,98,71,96,B+,347,86.75
Student 7,97,75,61,45,C,278,69.5
Student 9,51,90,58,67,B+,266,66.5


### notebook complete