# Summer of Code - Artificial Intelligence
## Week 04: Exploratory Data Analysis
### Day 03: Introduction to Pandas

In this notebook, we will explore the basics of the **Pandas** library.

# Introduction to Pandas

Pandas is a fast, powerful, and flexible Python library for working with tabular data.
- **Typical workflow**: load -> inspect -> clean -> transform -> analyze -> visualize.



## 1) Introduction to Pandas

Pandas is a fast, powerful, and flexible Python library for working with tabular data.

- **Typical workflow**: load -> inspect -> clean -> transform -> analyze -> visualize.



In [1]:
!pip install pandas



In [31]:
import pandas as pd
import numpy as np

pd.__version__

'2.2.3'

# Pandas Data Structures and Data Input

### Series: 1D labeled array
- Can hold any data type (int, float, str, Python objects)
- Has an index (labels) and values

### DataFrame: 2D labeled table
- Columns are `Series` aligned on the same index
- Index labels for rows, column labels for columns


In [4]:
pd.Series([1, 2, 3])

0    1
1    2
2    3
dtype: int64

In [5]:
pd.Series(["abc", "def", "ghi"])

0    abc
1    def
2    ghi
dtype: object

In [6]:
s = pd.Series([10, 20, 30])
s

0    10
1    20
2    30
dtype: int64

In [7]:
type(s)

pandas.core.series.Series

In [8]:
s.index

RangeIndex(start=0, stop=3, step=1)

In [13]:
s = pd.Series([100, 200, 300], index=[10, 20, 30])
s

10    100
20    200
30    300
dtype: int64

In [14]:
s[10]

np.int64(100)

In [16]:
s.index

Index([10, 20, 30], dtype='int64')

In [19]:
s.index = ["A", "B", "C"]
s

A    100
B    200
C    300
dtype: int64

In [20]:
s["A"]

np.int64(100)

In [26]:
s.values

array([100, 200, 300])

In [28]:
s = pd.Series([10, 20, "A"], index=["A", "B", "C"])
s

A    10
B    20
C     A
dtype: object

In [29]:
a_list = list(range(10))
a_list

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

In [30]:
pd.Series(a_list)

0    0
1    1
2    2
3    3
4    4
5    5
6    6
7    7
8    8
9    9
dtype: int64

In [32]:
arr = np.linspace(2, 10, 10)
arr

array([ 2.        ,  2.88888889,  3.77777778,  4.66666667,  5.55555556,
        6.44444444,  7.33333333,  8.22222222,  9.11111111, 10.        ])

In [33]:
arr = arr.reshape((5, 2))
arr

array([[ 2.        ,  2.88888889],
       [ 3.77777778,  4.66666667],
       [ 5.55555556,  6.44444444],
       [ 7.33333333,  8.22222222],
       [ 9.11111111, 10.        ]])

In [35]:
df = pd.DataFrame(arr)
df

Unnamed: 0,0,1
0,2.0,2.888889
1,3.777778,4.666667
2,5.555556,6.444444
3,7.333333,8.222222
4,9.111111,10.0


In [41]:
students = {
    "name": ["Ahmad", "Ali", "Waqas"],
    "age": [23, 30, 15],
    "gpa": [3.4, 2.5, 3.7],
    "roll_no": [10, 11, 12]
}
type(students)

dict

In [43]:
students_df = pd.DataFrame(students)
students_df

Unnamed: 0,name,age,gpa,roll_no
0,Ahmad,23,3.4,10
1,Ali,30,2.5,11
2,Waqas,15,3.7,12


In [44]:
students_df.index

RangeIndex(start=0, stop=3, step=1)

In [45]:
students_df.columns

Index(['name', 'age', 'gpa', 'roll_no'], dtype='object')

In [46]:
type(students_df)

pandas.core.frame.DataFrame

In [51]:
students_df = pd.read_csv('students.csv')
students_df

Unnamed: 0,name,age,major,gpa
0,Ada,20,CS,3.8
1,Grace,22,Math,3.9
2,Alan,21,CS,3.2
3,Linus,23,CS,3.5
4,Guido,24,Physics,3.7


# Selection and Indexing

## Row selection
By position with `iloc`, by label with `loc`


In [64]:
students_df.iloc[0]

name     Ada
age       20
major     CS
gpa      3.8
Name: 0, dtype: object

In [65]:
students_df.iloc[0, 2]

'CS'

In [66]:
students_df.iloc[:2]

Unnamed: 0,name,age,major,gpa
0,Ada,20,CS,3.8
1,Grace,22,Math,3.9


In [68]:
students_df.iloc[:2, :2]

Unnamed: 0,name,age
0,Ada,20
1,Grace,22


In [74]:
students_df.index = ["A", "B", "C", "D", "E"]

In [75]:
students_df

Unnamed: 0,name,age,major,gpa
A,Ada,20,CS,3.8
B,Grace,22,Math,3.9
C,Alan,21,CS,3.2
D,Linus,23,CS,3.5
E,Guido,24,Physics,3.7


In [78]:
students_df.loc["B"]

name     Grace
age         22
major     Math
gpa        3.9
Name: B, dtype: object

In [82]:
students_df.loc["A":"D"]

Unnamed: 0,name,age,major,gpa
A,Ada,20,CS,3.8
B,Grace,22,Math,3.9
C,Alan,21,CS,3.2
D,Linus,23,CS,3.5


In [85]:
students_df.loc["A":"D", ["name", "age"]]

Unnamed: 0,name,age
A,Ada,20
B,Grace,22
C,Alan,21
D,Linus,23


In [88]:
students_df.set_index("name", inplace=True)
students_df

Unnamed: 0_level_0,age,major,gpa
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Ada,20,CS,3.8
Grace,22,Math,3.9
Alan,21,CS,3.2
Linus,23,CS,3.5
Guido,24,Physics,3.7


In [89]:
students_df.index

Index(['Ada', 'Grace', 'Alan', 'Linus', 'Guido'], dtype='object', name='name')

In [90]:
students_df.loc["Ada"]

age       20
major     CS
gpa      3.8
Name: Ada, dtype: object

In [92]:
students_df.reset_index(inplace=True)
students_df

Unnamed: 0,name,age,major,gpa
0,Ada,20,CS,3.8
1,Grace,22,Math,3.9
2,Alan,21,CS,3.2
3,Linus,23,CS,3.5
4,Guido,24,Physics,3.7


## Column selection
By column names, `df["col"]` or `df[["col1","col2"]]`


In [53]:
age = students_df['age']
age

0    20
1    22
2    21
3    23
4    24
Name: age, dtype: int64

In [54]:
type(age)

pandas.core.series.Series

In [55]:
data = students_df[["age", "gpa"]]
data

Unnamed: 0,age,gpa
0,20,3.8
1,22,3.9
2,21,3.2
3,23,3.5
4,24,3.7


In [93]:
students_df

Unnamed: 0,name,age,major,gpa
0,Ada,20,CS,3.8
1,Grace,22,Math,3.9
2,Alan,21,CS,3.2
3,Linus,23,CS,3.5
4,Guido,24,Physics,3.7


## Conditional selection
Using boolean masks


In [98]:
students_df['age'] > 21

0    False
1     True
2    False
3     True
4     True
Name: age, dtype: bool

In [99]:
students_df[students_df['age'] > 21]

Unnamed: 0,name,age,major,gpa
1,Grace,22,Math,3.9
3,Linus,23,CS,3.5
4,Guido,24,Physics,3.7


## Subset selection
Using `loc` or `iloc`


# Operations on DataFrames

## Quick look
`head`, `tail`, `info`, `describe`


In [101]:
students_df.head(n=3)

Unnamed: 0,name,age,major,gpa
0,Ada,20,CS,3.8
1,Grace,22,Math,3.9
2,Alan,21,CS,3.2


In [102]:
students_df.tail(n=2)

Unnamed: 0,name,age,major,gpa
3,Linus,23,CS,3.5
4,Guido,24,Physics,3.7


In [103]:
students_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   name    5 non-null      object 
 1   age     5 non-null      int64  
 2   major   5 non-null      object 
 3   gpa     5 non-null      float64
dtypes: float64(1), int64(1), object(2)
memory usage: 292.0+ bytes


In [105]:
students_df.describe()

Unnamed: 0,age,gpa
count,5.0,5.0
mean,22.0,3.62
std,1.581139,0.277489
min,20.0,3.2
25%,21.0,3.5
50%,22.0,3.7
75%,23.0,3.8
max,24.0,3.9


## Unique values and counts
`unique`, `nunique`, `value_counts`


In [106]:
students_df

Unnamed: 0,name,age,major,gpa
0,Ada,20,CS,3.8
1,Grace,22,Math,3.9
2,Alan,21,CS,3.2
3,Linus,23,CS,3.5
4,Guido,24,Physics,3.7


In [107]:
students_df['major'].unique()

array(['CS', 'Math', 'Physics'], dtype=object)

In [108]:
students_df['major'].nunique()

3

In [109]:
students_df['major'].value_counts()

major
CS         3
Math       1
Physics    1
Name: count, dtype: int64

In [111]:
students_df['major'].value_counts(normalize=True)

major
CS         0.6
Math       0.2
Physics    0.2
Name: proportion, dtype: float64

In [112]:
students_df["major"].value_counts(ascending=True)

major
Math       1
Physics    1
CS         3
Name: count, dtype: int64

In [114]:
students_df

Unnamed: 0,name,age,major,gpa
0,Ada,20,CS,3.8
1,Grace,22,Math,3.9
2,Alan,21,CS,3.2
3,Linus,23,CS,3.5
4,Guido,24,Physics,3.7


In [116]:
grades = ["A", "A+", "C", "B", "B"]

students_df["grades"] = grades

In [117]:
students_df

Unnamed: 0,name,age,major,gpa,grades
0,Ada,20,CS,3.8,A
1,Grace,22,Math,3.9,A+
2,Alan,21,CS,3.2,C
3,Linus,23,CS,3.5,B
4,Guido,24,Physics,3.7,B


## Sorting and ordering


In [121]:
students_df.sort_values(by="age", ascending=False, inplace=True)

In [122]:
students_df

Unnamed: 0,name,age,major,gpa,grades
4,Guido,24,Physics,3.7,B
3,Linus,23,CS,3.5,B
1,Grace,22,Math,3.9,A+
2,Alan,21,CS,3.2,C
0,Ada,20,CS,3.8,A


In [123]:
students_df.reset_index()

Unnamed: 0,index,name,age,major,gpa,grades
0,4,Guido,24,Physics,3.7,B
1,3,Linus,23,CS,3.5,B
2,1,Grace,22,Math,3.9,A+
3,2,Alan,21,CS,3.2,C
4,0,Ada,20,CS,3.8,A


# Handling Missing Values

- Checking for missing: `isna`, `notna`
- Counting missing per column
- Deleting rows/columns with missing: `dropna` (`subset`, `thresh`)
- Imputing/filling: `fillna` (scalar, per-column dict, forward/back fill)
- Simple statistical imputation (mean/median)


In [124]:
students_df = pd.read_csv('students.csv')
students_df

Unnamed: 0,name,age,major,gpa
0,Ada,20.0,CS,3.8
1,Grace,,Math,3.9
2,Alan,21.0,CS,
3,Linus,23.0,CS,3.5
4,Guido,,,3.7


In [125]:
students_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   name    5 non-null      object 
 1   age     3 non-null      float64
 2   major   4 non-null      object 
 3   gpa     4 non-null      float64
dtypes: float64(2), object(2)
memory usage: 292.0+ bytes


In [126]:
students_df.describe()

Unnamed: 0,age,gpa
count,3.0,4.0
mean,21.333333,3.725
std,1.527525,0.170783
min,20.0,3.5
25%,20.5,3.65
50%,21.0,3.75
75%,22.0,3.825
max,23.0,3.9


In [135]:
students_df.isnull()

Unnamed: 0,name,age,major,gpa
0,False,False,False,False
1,False,True,False,False
2,False,False,False,True
3,False,False,False,False
4,False,True,True,False


In [136]:
students_df.isna()

Unnamed: 0,name,age,major,gpa
0,False,False,False,False
1,False,True,False,False
2,False,False,False,True
3,False,False,False,False
4,False,True,True,False


In [138]:
students_df.notna()

Unnamed: 0,name,age,major,gpa
0,True,True,True,True
1,True,False,True,True
2,True,True,True,False
3,True,True,True,True
4,True,False,False,True


In [139]:
students_df.isna().sum()

name     0
age      2
major    1
gpa      1
dtype: int64

In [144]:
data = students_df.copy()
data

Unnamed: 0,name,age,major,gpa
0,Ada,20.0,CS,3.8
1,Grace,,Math,3.9
2,Alan,21.0,CS,
3,Linus,23.0,CS,3.5
4,Guido,,,3.7


In [None]:
# data.dropna()

Unnamed: 0,name,age,major,gpa
0,Ada,20.0,CS,3.8
3,Linus,23.0,CS,3.5


In [None]:
# data.dropna(axis=1) 

Unnamed: 0,name
0,Ada
1,Grace
2,Alan
3,Linus
4,Guido


In [145]:
data.fillna(0)

Unnamed: 0,name,age,major,gpa
0,Ada,20.0,CS,3.8
1,Grace,0.0,Math,3.9
2,Alan,21.0,CS,0.0
3,Linus,23.0,CS,3.5
4,Guido,0.0,0,3.7


In [148]:
data.fillna(method='ffill')

  data.fillna(method='ffill')


Unnamed: 0,name,age,major,gpa
0,Ada,20.0,CS,3.8
1,Grace,20.0,Math,3.9
2,Alan,21.0,CS,3.9
3,Linus,23.0,CS,3.5
4,Guido,23.0,CS,3.7


In [149]:
data.fillna(method='bfill')

  data.fillna(method='bfill')


Unnamed: 0,name,age,major,gpa
0,Ada,20.0,CS,3.8
1,Grace,21.0,Math,3.9
2,Alan,21.0,CS,3.5
3,Linus,23.0,CS,3.5
4,Guido,,,3.7


In [150]:
data.bfill()

Unnamed: 0,name,age,major,gpa
0,Ada,20.0,CS,3.8
1,Grace,21.0,Math,3.9
2,Alan,21.0,CS,3.5
3,Linus,23.0,CS,3.5
4,Guido,,,3.7


In [151]:
data

Unnamed: 0,name,age,major,gpa
0,Ada,20.0,CS,3.8
1,Grace,,Math,3.9
2,Alan,21.0,CS,
3,Linus,23.0,CS,3.5
4,Guido,,,3.7


In [152]:
mean_age = data['age'].mean()
mean_age

np.float64(21.333333333333332)

In [153]:
data['age']

0    20.0
1     NaN
2    21.0
3    23.0
4     NaN
Name: age, dtype: float64

In [154]:
data['age'].fillna(mean_age)

0    20.000000
1    21.333333
2    21.000000
3    23.000000
4    21.333333
Name: age, dtype: float64

In [None]:
# data["age"] = data["age"].fillna(data["age"].mean())
data["age"] = data["age"].fillna(mean_age)
data

Unnamed: 0,name,age,major,gpa
0,Ada,20.0,CS,3.8
1,Grace,21.333333,Math,3.9
2,Alan,21.0,CS,
3,Linus,23.0,CS,3.5
4,Guido,21.333333,,3.7


### Practice Exercises

Try these small tasks to reinforce learning:

1. From `students`, select only `name` and `age` where `gpa > 3.6`.
2. In `exam_df`, compute an average score across `midterm`, `final`, and `project` as a new column.
3. Count how many students per `major` and sort descending.
4. Introduce a missing value in `students['age']`, then impute using the median age.
5. Save `students` to a CSV file and read it back, verifying types with `info()`.
