# Introduction to Pandas DataFrame 

Imagine you're in a classroom. The Pandas DataFrame is like a class register (attendance book). In that register:

- The rows are like students in your class.
- The columns are like the subjects each student has. For example: one column for "Name", another for "Math marks", one for "Science marks", etc.
- The index is like the roll numbers assigned to each student.
So, the Pandas DataFrame is nothing but a table that helps you organize your data in rows and columns, exactly like how you maintain school or college records.

### What is a DataFrame?

A DataFrame is basically a 2D (two-dimensional) data structure in Pandas, meaning it has both rows and columns. It's one of the most important parts of Pandas, which is why we focus on it so much.

- Just like your class register has names and marks of students, a DataFrame holds labeled data.
- You can store different types of data in one DataFrame: numbers, strings (like names), dates, etc.

## Creating a Simple DataFrame:

Now, let's say you want to create a DataFrame for a class with details of 3 students, their marks in Math, and the city they are from. In Pandas, you can do this using Python, and it’s as simple as writing the names and marks in the code.

To create a simple DataFrame in Pandas, you can use various data structures like lists, dictionaries, or even external data sources such as CSV files or databases. Let's start with the basics of DataFrame creation.

## 1. Creating a DataFrame from a Dictionary
we are familiar with storing data in tabular forms like a school attendance register. Consider a dictionary as something like a record where each key is a "column heading" and the values are lists of "entries" in that column.

#### Example: Student details in a classroom

In [187]:
import pandas as pd

# Data as a dictionary
student_data = {
    'Name': ['Aman', 'Priya', 'Raj', 'Sneha'],
    'Age': [16, 17, 16, 18],
    'City': ['Delhi', 'Mumbai', 'Bangalore', 'Chennai']
}

# Convert dictionary to DataFrame
df = pd.DataFrame(student_data)
print(df)

    Name  Age       City
0   Aman   16      Delhi
1  Priya   17     Mumbai
2    Raj   16  Bangalore
3  Sneha   18    Chennai


Name, Age, City are like the headings of our columns, just like a ledger's headings 

The values are the actual data, similar to entries in each student's row in your school register.

## 2. Creating a DataFrame from a List of Lists

Think of a list of lists as a table with rows of data. It's like when you jot down data in rows in your notebook for each student.

In [188]:
# List of lists where each list is a row (student's data)
data = [
    ['Aman', 16, 'Delhi'],
    ['Priya', 17, 'Mumbai'],
    ['Raj', 16, 'Bangalore'],
    ['Sneha', 18, 'Chennai']
]

# Create DataFrame and specify column labels
df = pd.DataFrame(data, columns=['Name', 'Age', 'City'])
print(df)

    Name  Age       City
0   Aman   16      Delhi
1  Priya   17     Mumbai
2    Raj   16  Bangalore
3  Sneha   18    Chennai


Each inner list represents a row, which is like a complete entry for a student in the class. It’s just like making rows in a notebook for student records.

## 3. Creating a DataFrame from a List of Dictionaries

In [189]:
# List of dictionaries where each dictionary is a row
family_data = [
    {'Name': 'Aman', 'Age': 16, 'City': 'Delhi'},
    {'Name': 'Priya', 'Age': 17, 'City': 'Mumbai'},
    {'Name': 'Raj', 'Age': 16, 'City': 'Bangalore'},
    {'Name': 'Sneha', 'Age': 18, 'City': 'Chennai'}
]

# Convert list of dictionaries to DataFrame
df = pd.DataFrame(family_data)
print(df)

    Name  Age       City
0   Aman   16      Delhi
1  Priya   17     Mumbai
2    Raj   16  Bangalore
3  Sneha   18    Chennai


Each dictionary can be seen as a person’s record, much like writing an entry for each person in a family or an office group.

## 4. Creating a DataFrame from NumPy Arrays
In India, we often deal with numbers, such as exam scores or bank transactions. Using NumPy arrays, we can store numerical data efficiently. Let’s create a DataFrame from NumPy arrays, much like a marksheet of students.

In [190]:
import numpy as np

# Create NumPy arrays for data
names = np.array(['Aman', 'Priya', 'Raj', 'Sneha'])
marks = np.array([85, 90, 78, 88])
cities = np.array(['Delhi', 'Mumbai', 'Bangalore', 'Chennai'])

# Create DataFrame from NumPy arrays
df = pd.DataFrame({
    'Name': names,
    'Marks': marks,
    'City': cities
})
print(df)

    Name  Marks       City
0   Aman     85      Delhi
1  Priya     90     Mumbai
2    Raj     78  Bangalore
3  Sneha     88    Chennai


Here, NumPy arrays are used to hold data, just like a marksheet with names and scores. In this case, each array holds a column of data.

## 5. Creating DataFrame from CSV or Excel files
In real life, we often store data in Excel or CSV files, like monthly electricity bills or bank statements. Pandas makes it very easy to read data from these files.

In [None]:
# Reading from a CSV file (file needs to be present in your system)
df = pd.read_csv('students_data.csv')
print(df)

Here, Pandas will load the CSV file data into a DataFrame, just like copying the table from Excel into Python.

## 1. Accessing a Single Column (Like looking at a specific category in your notebook)
When you want to check a specific column, such as names, ages, or marks, it’s like checking only one subject or column in your register.

Example: Student Data

In [234]:
import pandas as pd

data = {
    'Name': ['Rahul', 'Anjali', 'Sameer', 'Neha'],
    'Age': [18, 17, 19, 18],
    'Marks': [85, 92, 75, 88]
}

df = pd.DataFrame(data)

In [235]:
print(df)

     Name  Age  Marks
0   Rahul   18     85
1  Anjali   17     92
2  Sameer   19     75
3    Neha   18     88


Accessing a column by name:

In [236]:
# Access the 'Name' column
print(df['Name'])

0     Rahul
1    Anjali
2    Sameer
3      Neha
Name: Name, dtype: object


Here, you’re just looking at the "Name" column, like how you’d open your school register and look at all students’ names. You’re focusing on that one specific column.

## 2. Accessing Multiple Columns (Like checking multiple subjects at once)
You can access more than one column at the same time, similar to looking at both names and marks in your register.

In [237]:
# Accessing multiple columns ('Name' and 'Marks')
print(df[['Name', 'Marks']])

     Name  Marks
0   Rahul     85
1  Anjali     92
2  Sameer     75
3    Neha     88


Here, you're pulling out both the Name and Marks columns—like checking both subjects in your register at the same time. You’re essentially selecting two columns from the DataFrame.

## 3. Accessing Rows by Index (Looking at a specific student's details)
If you want to see all details of one particular student (like Rahul), you would find his row in the register.

Accessing a row by index:


In [238]:
# Accessing the first row (Rahul's data)
print(df.loc[0])

Name     Rahul
Age         18
Marks       85
Name: 0, dtype: object


This is like saying, "Tell me all about the first student in the register." You’re getting the entire row—Name, Age, and Marks—for that student.

## 4. Accessing Rows by Label with .loc[]
Pandas has this nice .loc[] method, which allows you to access data by its label (row index). Think of it like referring to a page number or entry number in your notebook.

Accessing multiple rows:

In [239]:
# Access the rows of Rahul and Anjali
print(df.loc[[0, 1]])

     Name  Age  Marks
0   Rahul   18     85
1  Anjali   17     92


You’re asking for all the details of Rahul and Anjali, just like reading two lines from your school register that contain all details for those students.

## 5. Accessing a Specific Value (Looking at one piece of information)
Let’s say you want to know Anjali’s Marks. You’ll need to go to Anjali's row and look under the Marks column, just like in your register.

Accessing a specific value:

In [240]:
# Access Anjali's Marks
print(df.loc[1, 'Marks'])

92


You’re looking for the Marks of the student at index 1 (Anjali). Just like finding Anjali in the school record and checking her marks.

## 6. Slicing (Accessing a range of rows and columns)
Sometimes, you might want to look at multiple students and focus on particular subjects, like how you’d look at rows 2 to 4 and only focus on Name and Marks.

Example: Access a slice of the DataFrame

In [241]:
# Get a slice of the DataFrame (2nd to 4th rows, and 'Name' & 'Marks' columns)
print(df.loc[1:3, ['Name', 'Marks']])

     Name  Marks
1  Anjali     92
2  Sameer     75
3    Neha     88


This is like reading Anjali, Sameer, and Neha’s names and marks only, leaving out the other details. You’ve sliced a portion of the DataFrame to focus on specific rows and columns.

## 7. Accessing by Position Using .iloc[] (Like accessing by row number)
Sometimes you don't know the labels, but you know the positions. For instance, you just want to check the 2nd student’s marks without caring about their name. .iloc[] helps you access rows and columns based on their position (just like row numbers in a notebook).

Example: Access row and column by position

In [242]:
# Access the value in the second row and third column (Sameer’s marks)
print(df.iloc[2, 2])

75


You are saying, "Show me the value in the 2nd row (Sameer) and the 3rd column (Marks)." It’s like going to row 3 of your register and looking at a specific column.

## 8. Accessing Multiple Rows and Columns with .iloc[]
You can also use .iloc[] to get multiple rows and columns at the same time, much like checking several lines in your notebook.

Example: Access multiple rows and columns

In [243]:
# Access rows 1 to 3 and columns 1 to 2
print(df.iloc[1:4, 0:2])

     Name  Age
1  Anjali   17
2  Sameer   19
3    Neha   18


You’re asking for the Name and Age of students from the 2nd to 4th rows, just like selecting a portion of your register.

## 9. Conditional Access (Filtering data based on conditions)
If your teacher asks, "Show me students who scored more than 80 marks," you'd manually go through the register and note down the names of students meeting that condition. In Pandas, we can easily filter data based on conditions.

Example: Access students who scored more than 80 marks

In [244]:
# Students with marks more than 80
print(df[df['Marks'] > 80])

     Name  Age  Marks
0   Rahul   18     85
1  Anjali   17     92
3    Neha   18     88


You are filtering out only those students who have scored more than 80 marks, similar to how you’d filter students manually in your notebook based on their marks.

## 10. Accessing with Conditions on Multiple Columns
Let’s say you want to check who is above 18 years and scored more than 80 marks. Just like flipping through pages in your register, you can use multiple conditions to filter data.

Example: Access based on multiple conditions

In [245]:
# Students older than 18 and marks more than 60
result = df[(df['Age'] > 18) & (df['Marks'] > 60)]
print(result)

     Name  Age  Marks
2  Sameer   19     75


Now that you've learned how to access data in a Pandas DataFrame using conditions and filter multiple columns, the next steps can involve more advanced data manipulation and analysis techniques

## 1. Chaining Conditions
Chaining conditions allows you to create more complex queries to filter your DataFrame based on multiple criteria.

Example: Students Older Than 18 with Marks Greater Than 70

In [246]:
import pandas as pd

# Sample DataFrame
data = {
    'Name': ['Rahul', 'Anjali', 'Sameer', 'Neha'],
    'Age': [18, 17, 19, 18],
    'Marks': [85, 92, 75, 88],
    'City': ['Delhi', 'Mumbai', 'Bangalore', 'Chennai']
}

df = pd.DataFrame(data)
print(df)

     Name  Age  Marks       City
0   Rahul   18     85      Delhi
1  Anjali   17     92     Mumbai
2  Sameer   19     75  Bangalore
3    Neha   18     88    Chennai


In [247]:
# Chaining conditions
filtered_students = df[(df['Age'] > 18) & (df['Marks'] > 70)]
print(filtered_students)

     Name  Age  Marks       City
2  Sameer   19     75  Bangalore


## 2. Using isin() for Filtering
The isin() function checks if each element in a DataFrame column is contained in a provided list. This is useful for filtering rows based on multiple possible values.

Example: Filter Students from Specific Cities

In [248]:
# Filtering students from Delhi and Mumbai
filtered_cities = df[df['City'].isin(['Delhi', 'Mumbai'])]
print(filtered_cities)

     Name  Age  Marks    City
0   Rahul   18     85   Delhi
1  Anjali   17     92  Mumbai


## 3. Using query() Method
The query() method allows you to filter a DataFrame using a string expression. It provides a cleaner and more readable syntax, especially for complex conditions.

Example: Query for Students Older Than 18

In [249]:
# Using query to filter
result = df.query('Age > 18 and Marks > 70')
print(result)

     Name  Age  Marks       City
2  Sameer   19     75  Bangalore


## 4. Sorting Data
Sorting helps to arrange your DataFrame based on specific column values. This is useful for identifying top or bottom performers in your dataset.

Example: Sort by Marks in Descending Order

In [250]:
# Sorting by Marks
sorted_df = df.sort_values(by='Marks', ascending=False)
print(sorted_df)

     Name  Age  Marks       City
1  Anjali   17     92     Mumbai
3    Neha   18     88    Chennai
0   Rahul   18     85      Delhi
2  Sameer   19     75  Bangalore


You can also sort by multiple columns:

In [251]:
# Sort by Age, then by Marks
sorted_df_multi = df.sort_values(by=['Age', 'Marks'], ascending=[True, False])
print(sorted_df_multi)

     Name  Age  Marks       City
1  Anjali   17     92     Mumbai
3    Neha   18     88    Chennai
0   Rahul   18     85      Delhi
2  Sameer   19     75  Bangalore


## 5. Group By Operations
The groupby() function is powerful for aggregating data based on one or more columns. You can compute statistics like mean, sum, count, etc.

Example: Group By City and Calculate Average Marks

In [252]:
# Group by City and calculate average Marks
grouped_df = df.groupby('City')['Marks'].mean().reset_index()
print(grouped_df)

        City  Marks
0  Bangalore   75.0
1    Chennai   88.0
2      Delhi   85.0
3     Mumbai   92.0


The reset_index() method is used to convert the resulting series back into a DataFrame.

## 6. Applying Functions with apply()
The apply() function lets you apply a custom function across rows or columns. This is especially useful for data transformations.

Example: Increase Marks by 5

In [253]:
# Define a function to increase marks
def increase_marks(marks):
    return marks + 5

# Apply the function to Marks column
df['Updated Marks'] = df['Marks'].apply(increase_marks)
print(df)

     Name  Age  Marks       City  Updated Marks
0   Rahul   18     85      Delhi             90
1  Anjali   17     92     Mumbai             97
2  Sameer   19     75  Bangalore             80
3    Neha   18     88    Chennai             93


## 7. Handling Missing Data
Missing data can skew your analysis, so handling it properly is crucial.

Example: Dropping Missing Values

In [254]:
# Assuming there are missing values in the DataFrame
df.loc[2, 'Marks'] = None  # Introduce a missing value for demonstration

# Drop rows with any missing values
cleaned_df = df.dropna()
print(cleaned_df)

     Name  Age  Marks     City  Updated Marks
0   Rahul   18   85.0    Delhi             90
1  Anjali   17   92.0   Mumbai             97
3    Neha   18   88.0  Chennai             93


filling missing value

In [255]:
# Calculate the mean and fill missing values without inplace
df['Marks'] = df['Marks'].fillna(df['Marks'].mean())
print(df)

     Name  Age      Marks       City  Updated Marks
0   Rahul   18  85.000000      Delhi             90
1  Anjali   17  92.000000     Mumbai             97
2  Sameer   19  88.333333  Bangalore             80
3    Neha   18  88.000000    Chennai             93


## 8. Merging DataFrames
Merging is similar to SQL joins. It’s useful when you have related datasets that you want to combine.

Example: Merging Two DataFrames

In [256]:
# Create another DataFrame with City info
data2 = {
    'Name': ['Rahul', 'Anjali', 'Sameer', 'Neha'],
    'City': ['Delhi', 'Mumbai', 'Bangalore', 'Chennai']
}
df2 = pd.DataFrame(data2)

# Merge on Name
merged_df = pd.merge(df, df2, on='Name')
print(merged_df)

     Name  Age      Marks     City_x  Updated Marks     City_y
0   Rahul   18  85.000000      Delhi             90      Delhi
1  Anjali   17  92.000000     Mumbai             97     Mumbai
2  Sameer   19  88.333333  Bangalore             80  Bangalore
3    Neha   18  88.000000    Chennai             93    Chennai


After covering data access and handling missing values

#### Data Cleaning Techniques
Removing Duplicates: Use drop_duplicates() to remove duplicate rows.

In [257]:
# Remove duplicates
df_unique = df.drop_duplicates()
df_unique

Unnamed: 0,Name,Age,Marks,City,Updated Marks
0,Rahul,18,85.0,Delhi,90
1,Anjali,17,92.0,Mumbai,97
2,Sameer,19,88.333333,Bangalore,80
3,Neha,18,88.0,Chennai,93


Renaming Columns: Use rename() to rename columns for clarity

In [258]:
# Rename columns
df = df.rename(columns={'Marks': 'Score'}, inplace=True)