# Introduction to Pandas
March 11, 2025

In [1]:
import numpy as np
import pandas as pd

#### Pandas

Pandas is a Python library used for data manipulation and analysis. Built on top of NumPy, it provides easy- 
   to-use structures like: 
* **Series:** 1D array (like a column)
* **DataFrame:** 2D table like a spreadsheet)

**Series**

In [2]:
s = pd.Series([10, 20, 30, 40])
print(s)

0    10
1    20
2    30
3    40
dtype: int64


**Index:** The set of labels used for identifying rows in a DataFrame.

In [3]:
s=pd.Series([10, 20, 30, 40], index=['a','b','c','d']) 
print(s)

a    10
b    20
c    30
d    40
dtype: int64


The index is then customized to use labels ('a', 'b', 'c') instead of the default numbers. This index serves as row identifiers, making it easier to reference and manipulate specific rows.

**DataFrame**

In [4]:
# Create a dictionary of data
data = {'Name':['Alice','Bob','Charlie'],
       'Age':[25, 30, 35],
        'City':['New York','Los Angeles','Chicago']
       }
print(f"Dictionary: \n{data}\n")

Dictionary: 
{'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [25, 30, 35], 'City': ['New York', 'Los Angeles', 'Chicago']}



In [5]:
# Create a DataFrame from the dictionary
df = pd.DataFrame(data)
print("DataFrame with default index:\n")
print(df)


DataFrame with default index:

      Name  Age         City
0    Alice   25     New York
1      Bob   30  Los Angeles
2  Charlie   35      Chicago


In [6]:
# Set a custom index
df.index = ['a', 'b', 'c']
print("\nDataFrame with custom index:\n")
print(df)


DataFrame with custom index:

      Name  Age         City
a    Alice   25     New York
b      Bob   30  Los Angeles
c  Charlie   35      Chicago


In [7]:
print(type(data))

<class 'dict'>


In [8]:
print(type(df))

<class 'pandas.core.frame.DataFrame'>


#### **Other Formats to Create Pandas DataFrames**

* **From a list of lists:**

In [9]:
data_list = [
    ["Alice", 25, 50000],
    ["Bob", 30, 60000],
    ["Charlie", 35, 70000]
]
df_list = pd.DataFrame(data_list, columns=["Name", "Age", "Salary"])
print(df_list)

      Name  Age  Salary
0    Alice   25   50000
1      Bob   30   60000
2  Charlie   35   70000


* **From a CSV file:**
```
df = pd.read_csv("data.csv")
```

* **From an Excel file:**
```
df = pd.read_excel("data.xlsx")
```

In [10]:
df = pd.read_csv("data.csv")

In [11]:
print(df)

            Name  Gender  Physics  Chemistry  Maths  English
0  Katherine      Female      100         55    100       60
1    Neil           Male       60         55     50       60
2  Sushana        Female       85         88     99       80
3    Tejas          Male       65         75     79       70


#### **Data Understanding and Preparation**
##### **Checking Basic Information**
##### **Preview the data:**

In [None]:
print(df.head())  # First 5 rows

In [None]:
print(df.head(3))  # First 3 rows

In [None]:
print(df.tail())  # Last 5 rows

---
##### **Get dataset summary:**

In [None]:
print(df.info())  # General information about columns and data types

In [None]:
print(df.describe())  # Summary statistics for numerical columns

---
##### **Get column names:**

In [None]:
print(df.columns)

---
##### **Get number of rows and columns:**


In [None]:
print(df.shape)  # (rows, columns)

---
#### **Indexers in Pandas**

##### **Selecting Columns**
* **Using column names:**

In [None]:
print(df["Name"])  # Selecting Columns

In [None]:
print(df[["Name", "Maths"]])  # Select multiple columns

##### **Selecting Rows**
* **Using index numbers:**

In [None]:
print(df.iloc[0])  # First row

In [None]:
print(df.iloc[0])  # First row
print("\n")
print(df.iloc[1:3])  # Rows from index 1 to 2

* **Using labels:**

In [None]:
print(df.loc[0])  # First row if index is numeric

In [None]:
df.loc[1:3]

In [None]:
df.iloc[1:3]

---
#### **Handling Missing Values**

##### **Checking for Missing Values**

In [None]:
print(df.isnull())

In [None]:
print(df.isnull().sum()) # Count missing values in each column

##### **Filling Missing Values**
* Replace missing values with a specific number:

In [None]:
df.fillna(0, inplace=True)  # Replace all NaN values with 0

* Fill missing values with column mean (for numerical columns):

In [None]:
df["Physics"]= df["Physics"].fillna(df["Physics"].mean())

In [None]:
df

##### **Removing Missing Values**
* Remove rows with missing values:

In [None]:
df.dropna(inplace=True)

* Remove columns with missing values:

In [None]:
df.dropna(axis=1, inplace=True)

#### **Examining Numerical and Categorical Data**
##### **Numerical Data**

In [None]:
print(df.describe())  # Summary of numerical data

In [None]:
print("Mean of Phyiscs score:",df["Physics"].mean())  # Mean salary
print("Median of Phyiscs score:",df["Physics"].median())  # Median salary
print("Standard deviation of Phyiscs score:",df["Physics"].std())  # Standard deviationPhysics

##### **Categorical Data**

In [None]:
#Get unique categories:
print(df["Name"].unique())

In [None]:
#Count occurrences of each category:
print(df["Name"].value_counts())

In [None]:
#Convert categorical data into numerical format (Encoding):
df["Gender"] = df["Gender"].map({"Male": 0, "Female": 1})
df

In [None]:
df