# Data Analysis with Pandas üêº
**Introduction to Pandas**

**Pandas** is a Python library used for data analysis and manipulation. It is built on top of NumPy, so you already know how arrays work. Pandas makes it easier to handle labeled data like tables (rows and columns).

**Pandas mainly has three data structures:**

- **Series** ‚Äì 1D labeled array (like a single column in Excel).

- **DataFrame** ‚Äì 2D labeled table (like Excel sheet with rows and columns).

- **Panel** ‚Äì 3D labeled data (less used today, replaced by MultiIndex DataFrames)

We‚Äôll focus mostly on Series and DataFrame because Panel is rarely used in modern data analysis.

# Core Concept of **Pandas üêº**

## Install and Import Pandas

In [83]:
#!pip install pandas
import pandas as pd

## **Pandas Series üìä**

A Series is a one-dimensional array with labels (index).

In [84]:
# Example: Student marks in Math
math_marks = pd.Series([85, 90, 78, 92, 88], index=['Ali', 'Sara', 'Zia', 'Aisha', 'Bilal'])
print(math_marks)


Ali      85
Sara     90
Zia      78
Aisha    92
Bilal    88
dtype: int64


**Concepts:**

`math_marks.values` ‚Üí gives the raw data [85, 90, 78, 92, 88]

`math_marks.index` ‚Üí gives the labels ['Ali', 'Sara', 'Zia', 'Aisha', 'Bilal']

**Access Elements By Label(index)**

In [85]:
print(math_marks['Zia'])

78


**For Example** : Teacher wants to give 2 number extra to every student .

In [86]:
print(math_marks)
print("-"*30)
#add  2 number
math_marks += 2 
print(math_marks)

Ali      85
Sara     90
Zia      78
Aisha    92
Bilal    88
dtype: int64
------------------------------
Ali      87
Sara     92
Zia      80
Aisha    94
Bilal    90
dtype: int64


Teacher wants **Boolean Filtering**  and he wants to know , students who obtain 90 plus marks .

In [87]:
high_score = math_marks[math_marks>90]
print(high_score)

Sara     92
Aisha    94
dtype: int64


## **Pandas DataFrame üóÇÔ∏è**

A DataFrame is like a spreadsheet or SQL table. It has rows (index) and columns.

In [88]:
# Example : Student scores in multiple subjects.

data = {
    'Math': [85, 90, 78, 92, 88],
    'Physics': [80, 85, 75, 95, 89],
    'Chemistry': [82, 88, 79, 91, 87]
}

students = pd.DataFrame(data,index=['Ali', 'Sara', 'Zia', 'Aisha', 'Bilal'])
print(students)

       Math  Physics  Chemistry
Ali      85       80         82
Sara     90       85         88
Zia      78       75         79
Aisha    92       95         91
Bilal    88       89         87


**Accessing DataFrame Data**

In [89]:
# Accessing a single column
print(students["Math"])
print("")

# Accessing a multiple columns
print(students[['Math','Chemistry']])

Ali      85
Sara     90
Zia      78
Aisha    92
Bilal    88
Name: Math, dtype: int64

       Math  Chemistry
Ali      85         82
Sara     90         88
Zia      78         79
Aisha    92         91
Bilal    88         87


In [90]:
#Access a row by index 
print(students.loc['Zia'])

print("-"*30)

#Access a row by position 
print(students.iloc[1]) # Sara is present on index 1

Math         78
Physics      75
Chemistry    79
Name: Zia, dtype: int64
------------------------------
Math         90
Physics      85
Chemistry    88
Name: Sara, dtype: int64


**Basic DataFrame Operations**

In [91]:
#Add a new column 
students['Biology']=[85, 87, 80, 90, 86]
print(students)

       Math  Physics  Chemistry  Biology
Ali      85       80         82       85
Sara     90       85         88       87
Zia      78       75         79       80
Aisha    92       95         91       90
Bilal    88       89         87       86


In [92]:
#Calculate mean of marks for each student 
students['Average'] = students.mean(axis=1) #axis 1 = row wise mean 
print(students)

       Math  Physics  Chemistry  Biology  Average
Ali      85       80         82       85     83.0
Sara     90       85         88       87     87.5
Zia      78       75         79       80     78.0
Aisha    92       95         91       90     92.0
Bilal    88       89         87       86     87.5


In [93]:
#sort students by math marks
print(students.sort_values(by='Math' , ascending=False))

       Math  Physics  Chemistry  Biology  Average
Aisha    92       95         91       90     92.0
Sara     90       85         88       87     87.5
Bilal    88       89         87       86     87.5
Ali      85       80         82       85     83.0
Zia      78       75         79       80     78.0


**Basic Functionality**

In [94]:
#View first 3 rows 
print(students.head(3))

      Math  Physics  Chemistry  Biology  Average
Ali     85       80         82       85     83.0
Sara    90       85         88       87     87.5
Zia     78       75         79       80     78.0


In [95]:
#view last 2 rows 
print(students.tail(2))

       Math  Physics  Chemistry  Biology  Average
Aisha    92       95         91       90     92.0
Bilal    88       89         87       86     87.5


In [96]:
#Describe statistical summary 
print(students.describe())

            Math    Physics  Chemistry    Biology    Average
count   5.000000   5.000000   5.000000   5.000000   5.000000
mean   86.600000  84.800000  85.400000  85.600000  85.600000
std     5.458938   7.758866   4.827007   3.646917   5.308013
min    78.000000  75.000000  79.000000  80.000000  78.000000
25%    85.000000  80.000000  82.000000  85.000000  83.000000
50%    88.000000  85.000000  87.000000  86.000000  87.500000
75%    90.000000  89.000000  88.000000  87.000000  87.500000
max    92.000000  95.000000  91.000000  90.000000  92.000000


In [97]:
#check info about DataFrame
print(students.info())

<class 'pandas.core.frame.DataFrame'>
Index: 5 entries, Ali to Bilal
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Math       5 non-null      int64  
 1   Physics    5 non-null      int64  
 2   Chemistry  5 non-null      int64  
 3   Biology    5 non-null      int64  
 4   Average    5 non-null      float64
dtypes: float64(1), int64(4)
memory usage: 412.0+ bytes
None


In [98]:
#check columns name 
print(students.columns)

Index(['Math', 'Physics', 'Chemistry', 'Biology', 'Average'], dtype='object')


**Reindexing in Pandas üîÑ**

Reindexing is **changing the order of rows or columns.**

In [99]:
print(students)

       Math  Physics  Chemistry  Biology  Average
Ali      85       80         82       85     83.0
Sara     90       85         88       87     87.5
Zia      78       75         79       80     78.0
Aisha    92       95         91       90     92.0
Bilal    88       89         87       86     87.5


In [100]:
new_index = ['Sara', 'Ali', 'Bilal', 'Zia', 'Aisha']
students_reindexed = students.reindex(new_index)
print(students_reindexed)

       Math  Physics  Chemistry  Biology  Average
Sara     90       85         88       87     87.5
Ali      85       80         82       85     83.0
Bilal    88       89         87       86     87.5
Zia      78       75         79       80     78.0
Aisha    92       95         91       90     92.0


In [101]:
# Add a new row with reindexing
new_index = ['Sara', 'Ali', 'Bilal', 'Zia', 'Aisha', 'John']
students_reindexed = students.reindex(new_index, fill_value=0)
print(students_reindexed)

       Math  Physics  Chemistry  Biology  Average
Sara     90       85         88       87     87.5
Ali      85       80         82       85     83.0
Bilal    88       89         87       86     87.5
Zia      78       75         79       80     78.0
Aisha    92       95         91       90     92.0
John      0        0          0        0      0.0


In [102]:
# Example 
import numpy as np 

#create a DataFrame with random numbers 
df = pd.DataFrame(np.random.randn(3,4) ,columns = list('ABCD'))
print("DataFrame : ")
print(df)

DataFrame : 
          A         B         C         D
0 -0.296333  0.109183 -0.709211 -1.040242
1  2.232295  0.147242  0.127425 -0.199208
2  1.208569  2.386243  0.085750  0.298032


In [103]:
print("Results:")
print("Dataype : ",df.dtypes)
print(df.shape)
print(df.ndim)
print(df.size)

Results:
Dataype :  A    float64
B    float64
C    float64
D    float64
dtype: object
(3, 4)
2
12


In [104]:
df =pd.DataFrame(
    {
        "A":[1,2,3],
        'B':[4,5,6]
    }
)
print(df , sep="\n")

#replace the content
df.replace({"A":1,'B':6},100 ,inplace=True)
print("---------------------")
print(df)


   A  B
0  1  4
1  2  5
2  3  6
---------------------
     A    B
0  100    4
1    2    5
2    3  100


**Deleting Columns**

In [105]:
df = pd.DataFrame({
    "A"  : [1,2,3],
    "B"  : [4,5,6],
    "C"  : [7,8,9]
})

print(df)
print('-----------------')
print("Drop columns A AND B")
print("Data Frame is ")
df = df.drop(columns=['A','B'])
print(df)

   A  B  C
0  1  4  7
1  2  5  8
2  3  6  9
-----------------
Drop columns A AND B
Data Frame is 
   C
0  7
1  8
2  9


**Deleting Rows**

In [106]:
df = pd.DataFrame({
    "A"  : [1,2,3],
    "B"  : [4,5,6],
    "C"  : [7,8,9]
},index = ["r1","r2","r3"])
print(df)
print("------------")
print("Drop r1")
#drop row with label 
df = df.drop('r1')
print(df)

    A  B  C
r1  1  4  7
r2  2  5  8
r3  3  6  9
------------
Drop r1
    A  B  C
r2  2  5  8
r3  3  6  9


### **Iteration over DataFrame üîÑ**

You can loop through rows, columns, or items.

**Iterate over rows**

In [107]:
data = {
    'Math': [85, 90, 78, 92, 88],
    'Physics': [80, 85, 75, 95, 89],
    'Chemistry': [82, 88, 79, 91, 87]
}

students = pd.DataFrame(data,index=['Ali', 'Sara', 'Zia', 'Aisha', 'Bilal'])
print(students)

       Math  Physics  Chemistry
Ali      85       80         82
Sara     90       85         88
Zia      78       75         79
Aisha    92       95         91
Bilal    88       89         87


In [108]:
for index , row in students.iterrows():
    print(f"{index} scored {row['Math']} in Math and {row['Physics']} in physics")

Ali scored 85 in Math and 80 in physics
Sara scored 90 in Math and 85 in physics
Zia scored 78 in Math and 75 in physics
Aisha scored 92 in Math and 95 in physics
Bilal scored 88 in Math and 89 in physics


**Iterate over Columns**

In [111]:
for col_name, col_data in students.items():
    print(f"Column: {col_name}")
    print(col_data)
    print("---")



Column: Math
Ali      85
Sara     90
Zia      78
Aisha    92
Bilal    88
Name: Math, dtype: int64
---
Column: Physics
Ali      80
Sara     85
Zia      75
Aisha    95
Bilal    89
Name: Physics, dtype: int64
---
Column: Chemistry
Ali      82
Sara     88
Zia      79
Aisha    91
Bilal    87
Name: Chemistry, dtype: int64
---


**Sorting Data with **index****

In [113]:
## Sort rows alphabetically by student names
print(students.sort_index())

       Math  Physics  Chemistry
Aisha    92       95         91
Ali      85       80         82
Bilal    88       89         87
Sara     90       85         88
Zia      78       75         79


**Selecting Data**

In [115]:
print(students)
print(students.iat[1,0])

       Math  Physics  Chemistry
Ali      85       80         82
Sara     90       85         88
Zia      78       75         79
Aisha    92       95         91
Bilal    88       89         87
90


In [116]:
#Ali phy score 
print(students.at['Ali','Physics'])

80
