# **Pandas Basics for Data Analysis**

--- 

In [None]:
# Installing Pandas library

In [None]:
# Uncomment the line below if pandas is not installed
# !pip install pandas

In [1]:
# Importing the pandas library
import pandas as pd

import warnings
warnings.filterwarnings("ignore")

---

## 1. **Creating and Importing Data in Pandas**

- To create a DataFrame, you can start with **dictionaries, lists, or NumPy arrays**. 
- This makes Pandas flexible in terms of how you bring in your data. 
- Often, you’ll import data from a file, such as a **CSV** using pd.read_csv(), or an **Excel** sheet with pd.read_excel(). 
- You can even connect to a SQL database using pd.read_sql(). 
- Once your data is in a DataFrame, use .head() to take a peek at the first few rows and .info() to see details like column names, data types, and missing values.

In [3]:
# Loading a dataset from a CSV file
df = pd.read_csv('churn_prediction.csv')

# Displaying the first few rows of the dataframe
df.head()  # This shows the first 5 rows of the dataset

Unnamed: 0,customer_id,vintage,age,gender,dependents,occupation,city,customer_nw_category,branch_code,current_balance,...,average_monthly_balance_prevQ,average_monthly_balance_prevQ2,current_month_credit,previous_month_credit,current_month_debit,previous_month_debit,current_month_balance,previous_month_balance,last_transaction,churn
0,1,2401,66,Male,0.0,self_employed,187.0,2,755,1458.71,...,1458.71,1449.07,0.2,0.2,0.2,0.2,1458.71,1458.71,2019-05-21,0
1,2,2648,35,Male,0.0,self_employed,,2,3214,5390.37,...,7799.26,12419.41,0.56,0.56,5486.27,100.56,6496.78,8787.61,2019-11-01,0
2,4,2494,31,Male,0.0,salaried,146.0,2,41,3913.16,...,4910.17,2815.94,0.61,0.61,6046.73,259.23,5006.28,5070.14,NaT,0
3,5,2629,90,,,self_employed,1020.0,2,582,2291.91,...,2084.54,1006.54,0.47,0.47,0.47,2143.33,2291.91,1669.79,2019-08-06,1
4,6,1879,42,Male,2.0,self_employed,1494.0,3,388,927.72,...,1643.31,1871.12,0.33,714.61,588.62,1538.06,1157.15,1677.16,2019-11-03,1


In [5]:
df.shape

(15929, 21)

---

## 2. **Exploring DataFrames**

- Once you have your DataFrame, it's time to explore it.
- Use .describe() to get a quick statistical summary of the numeric columns—things like mean, min, and max values.
- The .shape attribute tells you the number of rows and columns, while .columns lists all the column names.
- These basic exploration techniques help you understand what you're dealing with before diving into deeper analysis.

In [9]:
import pandas as pd
import numpy as np
import random

In [11]:
random.seed(101)

# Define realistic names of Indian and British origins
names = ['Aarav Patel', 'Sophia Smith', 'Liam Johnson', 'Saanvi Gupta', 
         'Olivia Brown', 'Indiana Jones', 'Rohan Sharma', 'James Wilson']

# Define genders corresponding to names
genders = ['Male', 'Female', 'Male', 'Female', 'Female', 'Male', 'Male', 'Male']

# Generate random data
ages = [random.randint(21, 25) for _ in range(8)]
marks = [random.randint(60, 95) for _ in range(8)]
subjects = random.choices(['Python', 'Statistics', 'Generative AI'], k=8)

# Create DataFrame
df = pd.DataFrame({
    'Name': names,
    'Gender': genders,
    'Age': ages,
    'Marks': marks,
    'Subject': subjects
})

df

Unnamed: 0,Name,Gender,Age,Marks,Subject
0,Aarav Patel,Male,25,74,Python
1,Sophia Smith,Female,22,78,Python
2,Liam Johnson,Male,25,91,Generative AI
3,Saanvi Gupta,Female,23,73,Statistics
4,Olivia Brown,Female,24,81,Statistics
5,Indiana Jones,Male,21,88,Python
6,Rohan Sharma,Male,25,64,Statistics
7,James Wilson,Male,22,76,Statistics


In [17]:
# Summary statistics of the dataframe
df.describe().round(2)

Unnamed: 0,Age,Marks
count,8.0,8.0
mean,23.38,78.12
std,1.6,8.61
min,21.0,64.0
25%,22.0,73.75
50%,23.5,77.0
75%,25.0,82.75
max,25.0,91.0


In [19]:
df.describe(include="object")

Unnamed: 0,Name,Gender,Subject
count,8,8,8
unique,8,2,3
top,Aarav Patel,Male,Statistics
freq,1,5,4


In [21]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8 entries, 0 to 7
Data columns (total 5 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   Name     8 non-null      object
 1   Gender   8 non-null      object
 2   Age      8 non-null      int64 
 3   Marks    8 non-null      int64 
 4   Subject  8 non-null      object
dtypes: int64(2), object(3)
memory usage: 452.0+ bytes


In [13]:
# Shape of the dataframe (rows, columns)
df.shape

(8, 5)

In [None]:
# Column names in the dataframe
df.columns

---

### 3. **Selecting and Filtering Data**

- Selecting and filtering data are key actions in Pandas.
- You can select a column by simply using df['column_name'], which returns a Series.
- To filter rows, use a condition, like selecting rows where the score is greater than 50 with df[df['score'] > 50].
- To access specific rows or columns, use .loc[] for label-based access or .iloc[] for position-based access.
- These methods are powerful tools for zeroing in on exactly the data you need.

In [23]:
# Selecting the 'Marks' column
df['Marks']

0    74
1    78
2    91
3    73
4    81
5    88
6    64
7    76
Name: Marks, dtype: int64

In [25]:
type(df)

pandas.core.frame.DataFrame

In [27]:
type(df['Marks'])

pandas.core.series.Series

In [29]:
df[["Name", 'Marks']]

Unnamed: 0,Name,Marks
0,Aarav Patel,74
1,Sophia Smith,78
2,Liam Johnson,91
3,Saanvi Gupta,73
4,Olivia Brown,81
5,Indiana Jones,88
6,Rohan Sharma,64
7,James Wilson,76


In [31]:
df

Unnamed: 0,Name,Gender,Age,Marks,Subject
0,Aarav Patel,Male,25,74,Python
1,Sophia Smith,Female,22,78,Python
2,Liam Johnson,Male,25,91,Generative AI
3,Saanvi Gupta,Female,23,73,Statistics
4,Olivia Brown,Female,24,81,Statistics
5,Indiana Jones,Male,21,88,Python
6,Rohan Sharma,Male,25,64,Statistics
7,James Wilson,Male,22,76,Statistics


In [33]:
# Filtering rows where Marks > 80
df[df['Marks'] > 80]

Unnamed: 0,Name,Gender,Age,Marks,Subject
2,Liam Johnson,Male,25,91,Generative AI
4,Olivia Brown,Female,24,81,Statistics
5,Indiana Jones,Male,21,88,Python


In [35]:
df

Unnamed: 0,Name,Gender,Age,Marks,Subject
0,Aarav Patel,Male,25,74,Python
1,Sophia Smith,Female,22,78,Python
2,Liam Johnson,Male,25,91,Generative AI
3,Saanvi Gupta,Female,23,73,Statistics
4,Olivia Brown,Female,24,81,Statistics
5,Indiana Jones,Male,21,88,Python
6,Rohan Sharma,Male,25,64,Statistics
7,James Wilson,Male,22,76,Statistics


In [37]:
# Selecting a specific row by label using .loc
df.loc[1]  # Second row

Name       Sophia Smith
Gender           Female
Age                  22
Marks                78
Subject          Python
Name: 1, dtype: object

In [39]:
# Selecting a specific row by index using .iloc
df.iloc[2]  # Third row

Name        Liam Johnson
Gender              Male
Age                   25
Marks                 91
Subject    Generative AI
Name: 2, dtype: object

In [41]:
df.iloc[:5]

Unnamed: 0,Name,Gender,Age,Marks,Subject
0,Aarav Patel,Male,25,74,Python
1,Sophia Smith,Female,22,78,Python
2,Liam Johnson,Male,25,91,Generative AI
3,Saanvi Gupta,Female,23,73,Statistics
4,Olivia Brown,Female,24,81,Statistics


In [43]:
df.iloc[:, :3]

Unnamed: 0,Name,Gender,Age
0,Aarav Patel,Male,25
1,Sophia Smith,Female,22
2,Liam Johnson,Male,25
3,Saanvi Gupta,Female,23
4,Olivia Brown,Female,24
5,Indiana Jones,Male,21
6,Rohan Sharma,Male,25
7,James Wilson,Male,22


In [45]:
df.iloc[-1] # last row

Name       James Wilson
Gender             Male
Age                  22
Marks                76
Subject      Statistics
Name: 7, dtype: object

In [47]:
df.iloc[:, -1]  # fetch the last column

0           Python
1           Python
2    Generative AI
3       Statistics
4       Statistics
5           Python
6       Statistics
7       Statistics
Name: Subject, dtype: object

---

### 4. **Adding and Modifying Columns**

- To add a new column in a DataFrame, you can simply define a new one by assigning a value or a calculation.
- For example, df['new_col'] = df['old_col'] * 2 adds a new column by doubling an existing one.
- If you need to modify a column, the .apply() method is useful—it allows you to apply a function to every value in the column.
- You can also rename columns using .rename(), which helps make your dataset more readable and organized.

In [49]:
df

Unnamed: 0,Name,Gender,Age,Marks,Subject
0,Aarav Patel,Male,25,74,Python
1,Sophia Smith,Female,22,78,Python
2,Liam Johnson,Male,25,91,Generative AI
3,Saanvi Gupta,Female,23,73,Statistics
4,Olivia Brown,Female,24,81,Statistics
5,Indiana Jones,Male,21,88,Python
6,Rohan Sharma,Male,25,64,Statistics
7,James Wilson,Male,22,76,Statistics


In [51]:
# Adding a new column 'Double_Marks'
df['Double_Marks'] = df['Marks'] * 2
df

Unnamed: 0,Name,Gender,Age,Marks,Subject,Double_Marks
0,Aarav Patel,Male,25,74,Python,148
1,Sophia Smith,Female,22,78,Python,156
2,Liam Johnson,Male,25,91,Generative AI,182
3,Saanvi Gupta,Female,23,73,Statistics,146
4,Olivia Brown,Female,24,81,Statistics,162
5,Indiana Jones,Male,21,88,Python,176
6,Rohan Sharma,Male,25,64,Statistics,128
7,James Wilson,Male,22,76,Statistics,152


In [53]:
# Updating values in 'Marks' column
df['Marks'] = df['Marks'].apply(lambda x: x + 5)
df

Unnamed: 0,Name,Gender,Age,Marks,Subject,Double_Marks
0,Aarav Patel,Male,25,79,Python,148
1,Sophia Smith,Female,22,83,Python,156
2,Liam Johnson,Male,25,96,Generative AI,182
3,Saanvi Gupta,Female,23,78,Statistics,146
4,Olivia Brown,Female,24,86,Statistics,162
5,Indiana Jones,Male,21,93,Python,176
6,Rohan Sharma,Male,25,69,Statistics,128
7,James Wilson,Male,22,81,Statistics,152


In [55]:
# Renaming the 'Marks' column to 'Updated_Marks'
df = df.rename(columns={'Marks': 'Updated_Marks'})
df

Unnamed: 0,Name,Gender,Age,Updated_Marks,Subject,Double_Marks
0,Aarav Patel,Male,25,79,Python,148
1,Sophia Smith,Female,22,83,Python,156
2,Liam Johnson,Male,25,96,Generative AI,182
3,Saanvi Gupta,Female,23,78,Statistics,146
4,Olivia Brown,Female,24,86,Statistics,162
5,Indiana Jones,Male,21,93,Python,176
6,Rohan Sharma,Male,25,69,Statistics,128
7,James Wilson,Male,22,81,Statistics,152


In [57]:
df.drop(["Double_Marks"], axis=1, inplace=True)
df

Unnamed: 0,Name,Gender,Age,Updated_Marks,Subject
0,Aarav Patel,Male,25,79,Python
1,Sophia Smith,Female,22,83,Python
2,Liam Johnson,Male,25,96,Generative AI
3,Saanvi Gupta,Female,23,78,Statistics
4,Olivia Brown,Female,24,86,Statistics
5,Indiana Jones,Male,21,93,Python
6,Rohan Sharma,Male,25,69,Statistics
7,James Wilson,Male,22,81,Statistics


---

### 5. **Indexing and Slicing DataFrames**

- Indexing in Pandas is essential for accessing specific parts of your data.
- By default, each row is labeled with an index number, but you can set a custom index with .set_index() to make accessing data more meaningful—like using customer IDs as the index.
- Slicing is useful for working with subsets of your data.
- For example, df[2:5] will return rows 2 through 4, allowing you to focus on a specific segment of your dataset.

In [63]:
# Accessing a specific value by index
df['Name'][0]

'Aarav Patel'

In [65]:
# Slicing the dataframe to get the first 2 rows
df[0:2]

Unnamed: 0,Name,Gender,Age,Updated_Marks,Subject
0,Aarav Patel,Male,25,79,Python
1,Sophia Smith,Female,22,83,Python


In [67]:
df

Unnamed: 0,Name,Gender,Age,Updated_Marks,Subject
0,Aarav Patel,Male,25,79,Python
1,Sophia Smith,Female,22,83,Python
2,Liam Johnson,Male,25,96,Generative AI
3,Saanvi Gupta,Female,23,78,Statistics
4,Olivia Brown,Female,24,86,Statistics
5,Indiana Jones,Male,21,93,Python
6,Rohan Sharma,Male,25,69,Statistics
7,James Wilson,Male,22,81,Statistics


In [69]:
df.index

RangeIndex(start=0, stop=8, step=1)

In [71]:
df.columns

Index(['Name', 'Gender', 'Age', 'Updated_Marks', 'Subject'], dtype='object')

In [73]:
# Setting 'Name' column as the index
df = df.set_index('Name')
df

Unnamed: 0_level_0,Gender,Age,Updated_Marks,Subject
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Aarav Patel,Male,25,79,Python
Sophia Smith,Female,22,83,Python
Liam Johnson,Male,25,96,Generative AI
Saanvi Gupta,Female,23,78,Statistics
Olivia Brown,Female,24,86,Statistics
Indiana Jones,Male,21,93,Python
Rohan Sharma,Male,25,69,Statistics
James Wilson,Male,22,81,Statistics


In [75]:
df.index

Index(['Aarav Patel', 'Sophia Smith', 'Liam Johnson', 'Saanvi Gupta',
       'Olivia Brown', 'Indiana Jones', 'Rohan Sharma', 'James Wilson'],
      dtype='object', name='Name')

---

### 6. **Sorting and Ordering Data**

- Sorting helps bring order to your data.
- You can sort rows using .sort_values(by='column_name').
- This is particularly useful for seeing the highest or lowest values in a dataset, such as sorting sales data to see the top-performing products.
- If you need a more complex ordering, you can sort by multiple columns.
- You can also sort by the index with .sort_index(), which is helpful if you’ve set a meaningful custom index.

In [79]:
# Sorting by 'Marks'
df = df.sort_values(by='Updated_Marks', ascending=False)
df

Unnamed: 0_level_0,Gender,Age,Updated_Marks,Subject
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Liam Johnson,Male,25,96,Generative AI
Indiana Jones,Male,21,93,Python
Olivia Brown,Female,24,86,Statistics
Sophia Smith,Female,22,83,Python
James Wilson,Male,22,81,Statistics
Aarav Patel,Male,25,79,Python
Saanvi Gupta,Female,23,78,Statistics
Rohan Sharma,Male,25,69,Statistics


In [81]:
# Sorting by multiple columns: 'Age' and then 'Marks'
df = df.sort_values(by=['Age', 'Updated_Marks'])
df

Unnamed: 0_level_0,Gender,Age,Updated_Marks,Subject
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Indiana Jones,Male,21,93,Python
James Wilson,Male,22,81,Statistics
Sophia Smith,Female,22,83,Python
Saanvi Gupta,Female,23,78,Statistics
Olivia Brown,Female,24,86,Statistics
Rohan Sharma,Male,25,69,Statistics
Aarav Patel,Male,25,79,Python
Liam Johnson,Male,25,96,Generative AI


In [83]:
# Sorting by index
df = df.sort_index()
df

Unnamed: 0_level_0,Gender,Age,Updated_Marks,Subject
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Aarav Patel,Male,25,79,Python
Indiana Jones,Male,21,93,Python
James Wilson,Male,22,81,Statistics
Liam Johnson,Male,25,96,Generative AI
Olivia Brown,Female,24,86,Statistics
Rohan Sharma,Male,25,69,Statistics
Saanvi Gupta,Female,23,78,Statistics
Sophia Smith,Female,22,83,Python


---

### 7. **Handling Missing Data**

- Handling missing data properly is vital to maintaining the integrity of your analysis.
- You can identify missing values with .isnull(), which returns True for any missing value.
- Once you’ve identified these gaps, you have options—you can fill them using .fillna(), maybe with the mean value or a placeholder.
- Alternatively, if you don’t want incomplete records, use .dropna() to remove them.
- Each choice will impact the outcome of your analysis, so it’s crucial to consider which approach makes the most sense for your data.

In [85]:
df.reset_index(inplace=True)

In [87]:
df

Unnamed: 0,Name,Gender,Age,Updated_Marks,Subject
0,Aarav Patel,Male,25,79,Python
1,Indiana Jones,Male,21,93,Python
2,James Wilson,Male,22,81,Statistics
3,Liam Johnson,Male,25,96,Generative AI
4,Olivia Brown,Female,24,86,Statistics
5,Rohan Sharma,Male,25,69,Statistics
6,Saanvi Gupta,Female,23,78,Statistics
7,Sophia Smith,Female,22,83,Python


In [89]:
df.iloc[4,2] = None
df.iloc[-1,2] = None
df

Unnamed: 0,Name,Gender,Age,Updated_Marks,Subject
0,Aarav Patel,Male,25.0,79,Python
1,Indiana Jones,Male,21.0,93,Python
2,James Wilson,Male,22.0,81,Statistics
3,Liam Johnson,Male,25.0,96,Generative AI
4,Olivia Brown,Female,,86,Statistics
5,Rohan Sharma,Male,25.0,69,Statistics
6,Saanvi Gupta,Female,23.0,78,Statistics
7,Sophia Smith,Female,,83,Python


In [91]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8 entries, 0 to 7
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Name           8 non-null      object 
 1   Gender         8 non-null      object 
 2   Age            6 non-null      float64
 3   Updated_Marks  8 non-null      int64  
 4   Subject        8 non-null      object 
dtypes: float64(1), int64(1), object(3)
memory usage: 452.0+ bytes


In [93]:
# Checking for missing values
df.isnull()

Unnamed: 0,Name,Gender,Age,Updated_Marks,Subject
0,False,False,False,False,False
1,False,False,False,False,False
2,False,False,False,False,False
3,False,False,False,False,False
4,False,False,True,False,False
5,False,False,False,False,False
6,False,False,False,False,False
7,False,False,True,False,False


In [95]:
# Counting missing values
df.isnull().sum()

Name             0
Gender           0
Age              2
Updated_Marks    0
Subject          0
dtype: int64

In [97]:
# Filling missing values in 'Marks' with the mean
df['Age'] = df['Age'].fillna(df['Age'].mean())
df

Unnamed: 0,Name,Gender,Age,Updated_Marks,Subject
0,Aarav Patel,Male,25.0,79,Python
1,Indiana Jones,Male,21.0,93,Python
2,James Wilson,Male,22.0,81,Statistics
3,Liam Johnson,Male,25.0,96,Generative AI
4,Olivia Brown,Female,23.5,86,Statistics
5,Rohan Sharma,Male,25.0,69,Statistics
6,Saanvi Gupta,Female,23.0,78,Statistics
7,Sophia Smith,Female,23.5,83,Python


In [99]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8 entries, 0 to 7
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Name           8 non-null      object 
 1   Gender         8 non-null      object 
 2   Age            8 non-null      float64
 3   Updated_Marks  8 non-null      int64  
 4   Subject        8 non-null      object 
dtypes: float64(1), int64(1), object(3)
memory usage: 452.0+ bytes


In [101]:
# Convert Age and Updated_Marks into Interger datatype.
df['Age'] = df['Age'].astype("int")
# df['Updated_Marks'] = df['Updated_Marks'].astype("int")
df

Unnamed: 0,Name,Gender,Age,Updated_Marks,Subject
0,Aarav Patel,Male,25,79,Python
1,Indiana Jones,Male,21,93,Python
2,James Wilson,Male,22,81,Statistics
3,Liam Johnson,Male,25,96,Generative AI
4,Olivia Brown,Female,23,86,Statistics
5,Rohan Sharma,Male,25,69,Statistics
6,Saanvi Gupta,Female,23,78,Statistics
7,Sophia Smith,Female,23,83,Python


In [103]:
# Dropping rows with any missing values
df1 = df.copy()  # create a copy of the original dataframe
df1.iloc[4,3] = None  # introduce some missing values
df1.iloc[-1,3] = None
df1

Unnamed: 0,Name,Gender,Age,Updated_Marks,Subject
0,Aarav Patel,Male,25,79.0,Python
1,Indiana Jones,Male,21,93.0,Python
2,James Wilson,Male,22,81.0,Statistics
3,Liam Johnson,Male,25,96.0,Generative AI
4,Olivia Brown,Female,23,,Statistics
5,Rohan Sharma,Male,25,69.0,Statistics
6,Saanvi Gupta,Female,23,78.0,Statistics
7,Sophia Smith,Female,23,,Python


In [105]:
df1 = df1.dropna()  # drop the rows which have missng values
df1

Unnamed: 0,Name,Gender,Age,Updated_Marks,Subject
0,Aarav Patel,Male,25,79.0,Python
1,Indiana Jones,Male,21,93.0,Python
2,James Wilson,Male,22,81.0,Statistics
3,Liam Johnson,Male,25,96.0,Generative AI
5,Rohan Sharma,Male,25,69.0,Statistics
6,Saanvi Gupta,Female,23,78.0,Statistics


---

### 8. **Data Cleaning Techniques**

- Data cleaning is a major part of any data analysis project.
- You can use .drop_duplicates() to remove any duplicate entries that may skew your results.
- To ensure consistency, you might need to convert data types using .astype()—for example, making sure all date columns are of the datetime type.
- For text fields, using .str.lower() helps standardize the data, which is especially useful when dealing with inconsistent text inputs, such as customer names entered in different formats.

In [107]:
# Creating a dataframe with duplicates
df2 = pd.DataFrame({
    'Name': ['Anna', 'Ben', 'Ben'],
    'Marks': [88, 75, 75],
    'Subject': ['Python', 'Statistics', 'Statistics']
})

df2

Unnamed: 0,Name,Marks,Subject
0,Anna,88,Python
1,Ben,75,Statistics
2,Ben,75,Statistics


In [109]:
# Removing duplicate rows
df2 = df2.drop_duplicates()
df2

Unnamed: 0,Name,Marks,Subject
0,Anna,88,Python
1,Ben,75,Statistics


In [111]:
# Converting 'Subject' to lowercase
df2['Subject'] = df2['Subject'].str.lower()
df2

Unnamed: 0,Name,Marks,Subject
0,Anna,88,python
1,Ben,75,statistics


---

### 9. **Merging and Joining DataFrames**

- Often, you’ll need to merge data from different sources.
- The .merge() function helps you combine two DataFrames based on a common key, such as customer IDs.
- You can choose different types of joins—inner join keeps only matching records, while outer join keeps all records, filling in gaps where necessary.
- The .concat() function allows you to add rows or columns to an existing DataFrame.
- For instance, you might merge customer profiles with sales data to get a complete view of customer behaviors.

In [113]:
# Creating two dataframes
df1 = pd.DataFrame({'Name': ['A', 'B'], 'Marks': [90, 80]})
df1

Unnamed: 0,Name,Marks
0,A,90
1,B,80


In [115]:
df2 = pd.DataFrame({'Name': ['A', 'C'], 'Age': [18, 17]})
df2

Unnamed: 0,Name,Age
0,A,18
1,C,17


In [117]:
# Merging the dataframes on 'Name'
pd.merge(df1, df2, on='Name', how='inner')

Unnamed: 0,Name,Marks,Age
0,A,90,18


In [119]:
pd.merge(df1, df2, on='Name', how='outer')

Unnamed: 0,Name,Marks,Age
0,A,90.0,18.0
1,B,80.0,
2,C,,17.0


In [121]:
# Concatenating two dataframes
df3 = pd.DataFrame({'Name': ['D', 'E'], 'Marks': [88, 92]})
df3

Unnamed: 0,Name,Marks
0,D,88
1,E,92


In [123]:
pd.concat([df1, df3])

Unnamed: 0,Name,Marks
0,A,90
1,B,80
0,D,88
1,E,92


---

### 10. **Grouping Data with GroupBy**

- The .groupby() function is one of the most powerful tools in Pandas.
- Imagine you have sales data from multiple regions, and you want to know the total sales per region.
- With .groupby(), you can split the data into groups and then aggregate each group using functions like .sum(), .mean(), or .count().
- This Split-Apply-Combine approach makes it easy to get high-level insights, such as regional sales performance or average spending per customer segment.

In [127]:
df

Unnamed: 0,Name,Gender,Age,Updated_Marks,Subject
0,Aarav Patel,Male,25,79,Python
1,Indiana Jones,Male,21,93,Python
2,James Wilson,Male,22,81,Statistics
3,Liam Johnson,Male,25,96,Generative AI
4,Olivia Brown,Female,23,86,Statistics
5,Rohan Sharma,Male,25,69,Statistics
6,Saanvi Gupta,Female,23,78,Statistics
7,Sophia Smith,Female,23,83,Python


In [129]:
# Grouping by 'Subject' and fetch  the mean 'Marks'
df.groupby('Subject')['Updated_Marks'].mean()

Subject
Generative AI    96.0
Python           85.0
Statistics       78.5
Name: Updated_Marks, dtype: float64

In [131]:
df.groupby('Subject')['Updated_Marks'].describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
Subject,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Generative AI,1.0,96.0,,96.0,96.0,96.0,96.0,96.0
Python,3.0,85.0,7.211103,79.0,81.0,83.0,88.0,93.0
Statistics,4.0,78.5,7.141428,69.0,75.75,79.5,82.25,86.0


---

### 11. **Pivot Tables in Pandas**

- Pivot tables are an incredibly powerful feature in Pandas.
- Just like in Excel, you can use .pivot_table() to summarize data in different ways.
- For example, you might have a DataFrame with sales transactions and you want to see total sales per region and product category.
- Pivot tables allow you to easily summarize data across multiple dimensions, giving you the flexibility to extract valuable insights from large datasets without writing a lot of code.

In [135]:
# Creating a pivot table to find average marks by subject
df.pivot_table(values='Updated_Marks', index='Subject', aggfunc=['mean', 'min'])

Unnamed: 0_level_0,mean,min
Unnamed: 0_level_1,Updated_Marks,Updated_Marks
Subject,Unnamed: 1_level_2,Unnamed: 2_level_2
Generative AI,96.0,96
Python,85.0,79
Statistics,78.5,69


---

#### **# Pivot Tables vs Groupby**

In pandas, groupby and pivot_table are both used to summarize data, but they serve slightly different purposes:

- **groupby** works well for single-dimensional summaries, such as total, average, or count per group.
- **pivot_table** is more flexible and ideal for multi-dimensional summaries, such as viewing multiple groups along multiple axes. It also fills in missing values as NaN, which isn’t as straightforward with groupby.

**Using groupby**

- Here, groupby is used to summarize Marks for each gender.

In [137]:
grouped = df.groupby('Gender')['Updated_Marks'].mean().reset_index()
grouped

Unnamed: 0,Gender,Updated_Marks
0,Female,82.333333
1,Male,83.6


**Using pivot_table**
- Now, let’s say we want to see the average marks of each gender across different subjects. pivot_table is a better choice here.

In [139]:
pivot = df.pivot_table(values='Updated_Marks', index='Gender', columns='Subject', aggfunc='mean')
pivot

Subject,Generative AI,Python,Statistics
Gender,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Female,,83.0,82.0
Male,96.0,86.0,75.0


---

### 12. **Applying Functions Using .apply() and .map()**

- Pandas makes it easy to apply functions across data.
- With .apply(), you can apply a custom function to an entire column or row.
- For example, if you want to apply a discount to each price, .apply() makes this very straightforward. .map() is another useful function, typically used for transforming values in a Series—like mapping a categorical value to a numerical one.
- Imagine categorizing age ranges into groups like young, middle-aged, and senior—these functions make it simple.

In [143]:
# Adding 10 to each value in the 'Marks' column
df['Updated_Marks'] = df['Updated_Marks'].apply(lambda x: x - 10)
df

Unnamed: 0,Name,Gender,Age,Updated_Marks,Subject
0,Aarav Patel,Male,25,74,Python
1,Indiana Jones,Male,21,88,Python
2,James Wilson,Male,22,76,Statistics
3,Liam Johnson,Male,25,91,Generative AI
4,Olivia Brown,Female,23,81,Statistics
5,Rohan Sharma,Male,25,64,Statistics
6,Saanvi Gupta,Female,23,73,Statistics
7,Sophia Smith,Female,23,78,Python


In [145]:
# Define a function to assign grades based on marks
def assign_grade(marks):
    """Assign grades based on the marks:
    - A: 85 and above
    - B: 70 to 84
    - C: below 70
    """
    if marks >= 85:
        return 'A'
    elif 70 <= marks < 85:
        return 'B'
    else:
        return 'C'

In [147]:
# Apply the function to the Marks column to create a new 'Grade' column
df['Grade'] = df['Updated_Marks'].apply(assign_grade)
df

Unnamed: 0,Name,Gender,Age,Updated_Marks,Subject,Grade
0,Aarav Patel,Male,25,74,Python,B
1,Indiana Jones,Male,21,88,Python,A
2,James Wilson,Male,22,76,Statistics,B
3,Liam Johnson,Male,25,91,Generative AI,A
4,Olivia Brown,Female,23,81,Statistics,B
5,Rohan Sharma,Male,25,64,Statistics,C
6,Saanvi Gupta,Female,23,73,Statistics,B
7,Sophia Smith,Female,23,78,Python,B


In [149]:
# Mapping values to a new 'Grade' column
df['Gender_new'] = df['Gender'].map({"Male": 0, "Female": 1})
df

Unnamed: 0,Name,Gender,Age,Updated_Marks,Subject,Grade,Gender_new
0,Aarav Patel,Male,25,74,Python,B,0
1,Indiana Jones,Male,21,88,Python,A,0
2,James Wilson,Male,22,76,Statistics,B,0
3,Liam Johnson,Male,25,91,Generative AI,A,0
4,Olivia Brown,Female,23,81,Statistics,B,1
5,Rohan Sharma,Male,25,64,Statistics,C,0
6,Saanvi Gupta,Female,23,73,Statistics,B,1
7,Sophia Smith,Female,23,78,Python,B,1


---

### 13. **Handling Dates and Times in Pandas**

- Working with dates and times is common in data analysis.
- With Pandas, you can easily convert strings into datetime objects using pd.to_datetime().
- Once converted, you can extract useful components like the year, month, or day, which are critical for time-series analysis.
- For example, you could analyze monthly sales trends to determine which months typically see the highest revenue.

In [171]:
# Creating a dataframe with date strings
df = pd.DataFrame({
    'Name': ['A', 'B', 'C'],
    'Date': ['2024-01-01', '2024-02-01', '2024-03-01']
})
df

Unnamed: 0,Name,Date
0,A,2024-01-01
1,B,2024-02-01
2,C,2024-03-01


In [165]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Name    3 non-null      object
 1   Date    3 non-null      object
dtypes: object(2)
memory usage: 180.0+ bytes


In [173]:
# Converting the 'Date' column to datetime format
df['Date'] = pd.to_datetime(df['Date'])

In [175]:
df

Unnamed: 0,Name,Date
0,A,2024-01-01
1,B,2024-02-01
2,C,2024-03-01


In [177]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype         
---  ------  --------------  -----         
 0   Name    3 non-null      object        
 1   Date    3 non-null      datetime64[ns]
dtypes: datetime64[ns](1), object(1)
memory usage: 180.0+ bytes


---