### Introduction to Pandas

Pandas is a Python library used for data manipulation and analysis. Pandas provides a convenient way to analyze and clean data.

The Pandas library introduces two new data structures to Python - Series and DataFrame, both of which are built on top of NumPy.

### What is Pandas Used for?

Pandas is a powerful library generally used for:

- Data Cleaning
- Data Transformation
- Data Analysis
- Machine Learning
- Data Visualization

### Why Use Pandas?

Some of the reasons why we should use Pandas are as follows:

1. **Handle Large Data Efficiently**

   Pandas is designed for handling large datasets. It provides powerful tools that simplify tasks like data filtering, transforming, and merging.

   It also provides built-in functions to work with formats like CSV, JSON, TXT, Excel, and SQL databases.

2. **Tabular Data Representation**

   Pandas DataFrames, the primary data structure of Pandas, handle data in tabular format. This allows easy indexing, selecting, replacing, and slicing of data.

3. **Data Cleaning and Preprocessing**

   Data cleaning and preprocessing are essential steps in the data analysis pipeline, and Pandas provides powerful tools to facilitate these tasks. It has methods for handling missing values, removing duplicates, handling outliers, data normalization, etc.

4. **Time Series Functionality**

   Pandas contains an extensive set of tools for working with dates, times, and time-indexed data as it was initially developed for financial modeling.

5. **Free and Open-Source**

   Pandas follows the same principles as Python, allowing you to use and distribute Pandas for free, even for commercial use.


## üß† 1. Introduction to Pandas

* What is Pandas?
* Why use it in data science?

## üß™ 2. Installing and Importing Pandas

* `pip install pandas`
* `import pandas as pd`

## üìä 3. Pandas Data Structures

* Series
* DataFrame

## üìÅ 4. Reading and Writing Files

* CSV, Excel, JSON
* `pd.read_csv`, `to_csv`, etc.

## üîç 5. Exploring Data

* `head()`, `tail()`, `info()`, `describe()`

## üîé 6. Indexing and Selecting Data

* `loc`, `iloc`, `at`, `iat`
* Conditional filtering

## üßπ 7. Data Cleaning

* Handling missing values
* Changing data types
* Renaming columns

## üîÅ 8. Applying Functions

* `apply()`, `map()`, `lambda` with Pandas

## üìà 9. GroupBy and Aggregations

* `groupby()`, `agg()`, `pivot_table()`

## üîó 10. Merging and Joining

* `merge()`, `concat()`, `join()`

## üï≥ 11. Handling Missing Data

* `isnull()`, `dropna()`, `fillna()`

## üíº 12. Real-World Business Example

* Sales dataset or Superstore analysis

## üß† 13. Assignments

### Install Pandas

To install pandas, you need Python and PIP installed on your system. If you have Python and PIP installed already, you can install pandas by entering the following command in the terminal:

```bash
pip install pandas
```

If the installation completes without any errors, Pandas is now successfully installed on your system. You can start using it in your Python projects by importing the Pandas library.

### Import Pandas in Python

We can import Pandas in Python using the import statement:

```python
import pandas as pd
```


In [None]:
# pip install pandas
# pip install pandas=1.9.0

In [None]:
import pandas
pandas.__version__

In [None]:
#pandas>>data manipulation and data wrangling

# create Series, DataFrame ,indexing, columns, data types, assign, create new columns

# 1.data Load --> different types
# 2.data basic Informations
# 3. null values check
# 3.1 null values remove
# 4. data type check and correct it
# 5. duplicates value check and remove
# 6. remove columns/ or
# 7. add columns after calculations
# 8. save the clean data
# 9. sovle the business questions /data analysis
# 10. visualize the data

# data cleaning
# drop columns,  drop rows, fill missing values, handle outliers, remove duplicates


# data transformation
# pivot, melt, groupby, sort, sortby, rank, quantile, shift,
# data merging
# join, merge, concat, append
# data analysis
# summary statistics, descriptive statistics, correlation, regression, time series analysis
# data visualization
# plot, scatter plot, bar plot, histogram, box plot, violin plot, heatmap
# data export
# to_csv, to_excel, to_json, to_pickle, to_sql



# A Pandas Series is a one-dimensional labeled array-like object that can hold data of any type.
### Labels

# The labels in the Pandas Series are
# index numbers by default. Like in DataFrame and array, the index number in Series starts from 0.


### Series
The pandas Series is a fundamental data structure in the Python pandas library. It is a **one-dimensional** labeled array capable of holding data of any type (integer, string, float, etc.), similar to a single column in a spreadsheet or database table. 

In [None]:
import numpy as np
a = np.array([1,2,3,4,5])
print(a)
a.ndim

In [None]:
type(a)

In [None]:
# series is a 1D array
import pandas as pd
s = pd.Series([1,2,3,4,5])
print(s)
s.ndim

In [None]:
type(s)

### Series and dataframe (Indexing & Slicing)
##### Series

In [None]:
# creaate the Series using list
import pandas as pd
li = [34,3,534,34,23]
se = pd.Series(li)
# indexing
se[2]

In [None]:
# assign new value
se[2] = 45
se

In [None]:
se[2:4]
# slicing

In [None]:
se[2:4] =4,5
se

In [None]:
# Create a Series and specify labels
se2 = pd.Series([12,32,243,45] , index=['a','b','c','d'],)
se2


In [None]:
se2['b']

In [None]:
se2[2]
# default indexing always use if you add your custom indexing

In [None]:
se2["d"]

In [None]:
# METHODS
se2.pop('b')
# delete the value

In [None]:
se2

In [None]:
se

In [None]:
# change the index name using index
se.index = ['a','b','c','d','e']
se

In [None]:
l2 = ['a','x','c','d','e']
se.index = l2
se

In [None]:
# change the index name using set_axis methods
se.set_axis(['x','y','z','a','b'])

In [None]:
se

In [None]:
se = se.set_axis(['x','y','z','a','b']) # reassign for permanent change
se

In [None]:
se.reset_index() # reset index with default indexing

In [None]:
type(se.reset_index())

## DataFrame /2D/ tabular data
A DataFrame is a 2-dimensional, labeled data structure with columns of potentially different types, organized like a spreadsheet or an SQL table. It is a fundamental tool in data analysis for storing, manipulating, and analyzing data, often using the pandas library in Python.

In [None]:
# create a dataframe
df = pd.DataFrame()
print(df)

In [None]:
[[1,2,3],[4,5,6],[7,8,9]]

In [None]:
# use np
np.array([[1,2,3],[4,5,6],[7,8,9]])

In [None]:
#  dataframe create using array
df1 = pd.DataFrame([[1,2,3],[4,5,6],[7,8,9]])
df1

In [None]:
#  dataframe create using dictionay
df2 = pd.DataFrame({"x":[1,2,3],"y":[4,5,6],"z":[7,8,9]})
df2

In [None]:
df2.set_axis(["A",'B','C'],axis=1)
# axis =0 means change the index/row name
# axis =1 means change the columns name

In [None]:
df2 = df2.set_axis(["A",'B','C'])
df2

### indexing  slicing in DataFrame



In [None]:
# indexing columns
df2["x"]

In [None]:
# more then one columns
df2[['x','y']]

###  .loc and .iloc 
In Pandas, .loc and .iloc are two fundamental methods used for selecting data from DataFrames, differing primarily in their indexing approach:
* .loc (Label-based indexing):
Accesses data by labels (names) of rows and columns.

Syntax: df.loc[row_label, column_label]

Slicing: When slicing, both the start and end labels are inclusive. For example, df.loc['start_row':'end_row'] will include both 'start\_row' and 'end\_row'.
Usage: Ideal when you know the specific names of the rows and columns you want to access, or when working with non-integer or custom indices. Can also be used with Boolean arrays for conditional selection.

* .iloc (Integer-position based indexing):
Accesses data by the integer positions of rows and columns. These positions are 0-indexed, meaning the first row/column is at position 0, the second at position 1, and so on.

Syntax: df.iloc[row_position, column_position]

Slicing: When slicing, the start position is inclusive, but the end position is exclusive. For example, df.iloc[0:5] will include rows at positions 0, 1, 2, 3, and 4, but not 5.
Usage: Useful when you need to select data based on its numerical position within the DataFrame, regardless of the actual labels. It can be faster for very large datasets due to its direct reliance on integer positions.
Key Differences Summarized:
Feature
.loc
.iloc
Indexing
Label-based (using row/column names)
Integer-position based (0-indexed)
Slicing
Inclusive of both start and end labels
Exclusive of the end position
Flexibility
Handles various index types (e.g., strings, dates)
Primarily for numerical integer positions
Speed
Generally slower for large datasets
Can be faster for large datasets due to direct position access
Example:
Python

import pandas as pd

data = {'col1': [1, 2, 3], 'col2': [4, 5, 6]}
df = pd.DataFrame(data, index=['rowA', 'rowB', 'rowC'])

# Using .loc
print(df.loc['rowB', 'col1'])  # Accesses value at row 'rowB', column 'col1'
print(df.loc['rowA':'rowC', 'col1']) # Slices rows from 'rowA' to 'rowC' (inclusive) for 'col1'

# Using .iloc
print(df.iloc[1, 0])  # Accesses value at row position 1, column position 0
print(df.iloc[0:2, 0]) # Slices rows from position 0 to 1 (exclusive of 2) for c

In [None]:
# loc and iloc
df2
# .loc (Label-based indexing):
# Syntax: df.loc[row_label, column_label]
# Slicing:df.loc['start_row':'end_row']

# .iloc (Integer-position based indexing):
# Syntax: df.iloc[row_position, column_position]

In [None]:
# return x column --> loc
df2.loc[:,'x']

In [None]:
# return x and y column --> loc
df2.loc[:,'x':'y']

In [None]:
# return x and y column
# iloc
df2.iloc[:,0:2]

In [None]:
# get 2 rows from starting and 2 cols from starting  df2 = df2.set_axis(["A",'B','C'])
df2.loc[:"B", 'x':"y"]


In [None]:
df2.loc['A','x':'z'] # only 1 row data

In [None]:
# ge5 the value  4 from Table/df only
df2.loc['A',"y"]

In [None]:
df2

In [None]:
# show 4,7
df2.loc['A',"y":]
# slicing

In [None]:
# Attributes
df2.ndim
df2.shape
df2.size
df2.dtypes

In [None]:
## Methods 


In [None]:
s
# 2.data basic Informations
# 3. null values check
# 3.1 null values remove
# 4. data type check and correct it
# 5. duplicates value check and remove
# 6. remove columns/ or
# 7. add columns after calculations
# 8. save the clean data
# 9. sovle the business questions /data analysis
# 10. visualize the data

#### Load the Data from Files

csv, xls, website, database, cloud

In [None]:
#### 1.data Load --> different type
pd.read_csv('Netflix.csv')

In [None]:
import pandas as pd
# read the csv filef
f1 = pd.read_csv(r'C:\Users\hp\Documents\ds_materials\5.0_python_6.libraries\python_libraries\2.pandas\extra\historical_automobile_sales.csv')
f1.head() # starting  5 rows 

In [None]:
f1.head(10) # starting  10 rows 

In [None]:
!pip install openpyxl

In [None]:
!pip install xlrd

In [None]:
# read excel file
f2 = pd.read_excel(r"c:\Users\hp\Documents\ds_materials\5.0_python_6.libraries\python_libraries\Sample - Superstore.xls")
f2.head(20) # pass a number for show the 20 rows

In [None]:
import pandas as pd
# load Diffrent Files from csv url 
df = pd.read_csv("https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv")
df

In [None]:
# read html table data
!pip install lxml
import lxml
url_df = pd.read_html("https://www.basketball-reference.com/leagues/NBA_2015_totals.html")
url_df
type(url_df)

In [None]:
len(url_df) # 2 table in url

In [None]:
# show
url_df[0]

In [None]:
# save
url_df[0].to_csv("players.csv")

In [None]:
# read json file
df = pd.read_json("https://api.github.com/repos/pandas-dev/pandas/issues")
df

In [None]:
# database connect use
# 1.mysql-connector
# 2.sqlalchemy

In [None]:
# mysql connector
!pip install mysql-connector-python
import mysql.connector

In [None]:
#Create the connection object
myconn = mysql.connector.connect(host = "127.0.0.1", user = "root",
                                 passwd = "root" ,database = "mavenmovies")

#creating the cursor object
cur = myconn.cursor()
try:
    #Reading the Employee data
    cur.execute("select * from Actor")

    #fetching the rows from the cursor object
    result = cur.fetchall()
    #printing the result

    for x in result:
        print(x)
except:
    myconn.rollback()

myconn.close()

### Data Explore

## Use Titanic dataset

https://www.kaggle.com/competitions/titanic/data



# 1.data Load --> different types

In [None]:
import pandas as pd
df = pd.read_csv("https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv")
df.head() # show the starting 5 rows
df.head(10) # show only 10 rows

# bottom data
df.tail() # show the last 5 rows
df.tail(10) # show only 10 rows

# 2.data basic Informations


In [None]:
# shape, size ,ndim
print(df.shape)
print(df.size)
print(df.ndim)

In [None]:
# show the only names of table
df.columns

In [None]:
df.dtypes # show the data -->datatyps

In [None]:
# basic info about data
df.info()

In [None]:
df.head()

In [None]:
# statistical summary about data
df.describe()


In [None]:
df.describe(include = 'object')
df.describe(include = 'all')

In [None]:
df.sample()
# return a random value from df
df.sample(5)

In [None]:
df

In [None]:
# indxing and slcing
df["Name"] # only show the name column

In [None]:
# show the name and fare columns
df[["Name","Fare"]]

In [None]:
# show the value from name  to fare
df.loc[:,"Name":"Fare"]
# show the starting 10 passanger details  from name to fare

In [None]:
# show the data from row 10 to 20

# show the data from row 30 to 35 and show only name and fare


#Access df rows #implicit index>internal/integer index and explicit index/named>>define
df[0:100]
df.iloc[0:2] #start from 0 and go to 1
df.loc[0:2] #give me the rows whose name is 0, 1, 2

#loc will go with named indexes, iloc will go with inbuilt index
df.iloc[0:2, ['Name', 'Sex', 'Age']] #it will throw an error, why?
df.loc[0:2, ['Name', 'Sex', 'Age']]

df
df.iloc[0:2, 3:6]
list(df['Name'][2:5])


# 3. null values check
# 3.1 null values remove

In [None]:
df["Age"].isnull().sum()

In [None]:
#  isna , isnull ,fillna
df.isna().sum() # check null values

In [None]:
df.isnull().sum()

In [None]:
# if null values is more then 50% we should drop the columns
df.drop('Cabin',axis=1)

In [None]:
# for permanenet change
# 1. reassign
# df = df.drop('Cabin',axis=1)

# 2. use inplace parameter
df.drop('Cabin',axis=1,inplace=True)
df

#### Replace the null value

In [None]:
df.isnull().sum()

In [None]:
df["Age"].mean()  # find the mean
df["Age"].median()

In [None]:
# pip install matplotlib
# pip install seaborn
import matplotlib.pyplot as plt
import seaborn as sns
# create boxplot plt use
sns.boxplot(df["Age"])
plt.show()
# check the outlier

In [None]:
# warning remove
import warnings
warnings.filterwarnings("ignore")

In [None]:
# filna
df["Age"].fillna(29)

df["Age"].fillna(df["Age"].mean()) # only temperory change
df["Age"].fillna(df["Age"].mean(), inplace=True)

# df["Age"] = df["Age"].fillna(df["Age"].mean())

In [None]:
# categorical data replace
# mode
df["Embarked"].mode()[0]

df["Embarked"] = df["Embarked"].fillna(df["Embarked"].mode()[0])

In [None]:
df.isnull().sum()

In [None]:
# remove null values only in row wise
df.dropna(axis=0,inplace=True)

In [None]:
df.isnull().sum()

5. duplicates value check and remove

In [None]:
# check the duplicate value
df.duplicated().sum()

# remove the duplicate value
df.drop_duplicates(inplace=True)

4. data type check and correct it

In [None]:
df.head()

In [None]:
df.dtypes

In [None]:
# change the data types --> using as astype
df["PassengerId"] =  df["PassengerId"].astype("object")

In [None]:
# 7. add columns
df['age'] =0
df

In [None]:
df["Fare"]

In [None]:
# using
df["Fare"][0]

In [None]:
d =df["Fare"][3]
print(d)
if d>100:
  print("premium")
elif d>50:
  print("standard")
else:
  print("economy")

In [None]:
len(df)

In [None]:
df.head(65)

In [None]:
# custom list for update / create the links
# df.reset_index(inplace = True)
df["type"] = 0
for  i in range(len(df)):
  try:
    print(df["Fare"][i])
    if df["Fare"][i]>100:
      df["type"][i] ="premium"

      print("premium")
    elif df["Fare"][i]>50:
      print("standard")
      df["type"][i] ="standard"
    else:
      print("economy")
      df["type"][i] ="economy"
  except:
    pass

In [None]:
# apply -->condtions as a functions
def fare_type(x):
  if x>100:
    return "premium"
  elif x>50:
    return "standard"
  else:
    return "economy"

In [None]:
df["customer_type"] = df["Fare"].apply(fare_type)

In [None]:
df

In [None]:
# 6. remove columns
df = df.drop("type",axis=1)

In [None]:
# save the clean data
df.to_csv("cleantitanic_data.csv",index=False)

In [None]:
# laod
df = pd.read_csv("cleantitanic_data.csv")
df.head()

# 9. sovle the business questions /data analysis
 0. Boolean
 1. group by
 2. aggrigate
 3. sorting
 4. reset index  
 5. Rename columns
 6. handle Error value

## Boolean Indexing


In [None]:
# age
# conditions
df['Age'] >60
# and it return true or false

In [None]:
# show the passanger details of seniors 60
# df[pass boolean conditions as index]
# df[index]
df[df['Age'] >60]

In [None]:
# show the only Age columns
df["Age"][df['Age'] >60]

In [None]:
# show the name and age of passanger who is senior
df[["Name","Age"]][df['Age'] >60]

In [None]:
# Q.find the max fare
df["Fare"].max()
# Q.show the details of the passanger who pay the maximum fare

df["Fare"] == df["Fare"].max() # condition

# indexing
df[ df["Fare"] == df["Fare"].max() ]

In [None]:
# count the total number of teenager <18
df["Age"]<18 # conditions

len(df[df["Age"]<18])

In [None]:
# count the total number of female teenager
# multiple conditions
# boolean (0,1) and or not ,--> bitwise and or not(& ,| ,!)

# 1st conditions
df["Age"]<18
# 2nd conditions
df["Sex"] == "female"

(df["Age"]<18) & (df["Sex"] == "female")

df[(df["Age"]<18) & (df["Sex"] == "female")]

len(df[(df["Age"]<18) & (df["Sex"] == "female")])

In [None]:
# find the number of passanger who pay more then average fare
df["Fare"].mean()
df["Fare"] > df["Fare"].mean()
len(df[df["Fare"] > df["Fare"].mean()])

# find the passager who pay the max and servive
len(df[df["Fare"]== df["Fare"].max()])
df[df["Fare"]== df["Fare"].max()]


In [None]:
# find the passager who pay the minimum and servive
len(df[df["Fare"]== df["Fare"].min()])
df[df["Fare"]== df["Fare"].min()]

df["Fare"]== df["Fare"].min()
df["Survived"]==1
df[(df["Fare"]== df["Fare"].min()) & (df["Survived"]==1)]

In [None]:
# methods
# unique()
df["Embarked"].unique()  # show the distinct values from df
# nunique()
df["Embarked"].nunique() # show the distinct values count from df
# value_counts()
df["Embarked"].value_counts() # show the distinct values and count from df
df["Survived"].value_counts()

In [None]:

# Rename columns
df.rename(columns={'A': 'Age'}, inplace=True)

#### 1.group by

 The groupby() method splits a DataFrame or Series into groups based on the unique values in one or more specified columns

 df.groupby('column name')


In [None]:
df.groupby("Survived")["Fare"].mean()

In [None]:
# total fare pay by male and female
# embarrked wise sales

# multiple columns for groupings
# count the number  survive or not survive in each gender
df.groupby(["Sex","Survived"]).count()

In [None]:
df.groupby(["Sex","Survived"])["PassengerId"].count()

In [None]:
# Aggregation: Calculating a summary statistic for each group (e.g., sum(), mean(), count(), min(), max()).
df.groupby(["Survived"])["Fare"].aggregate(["mean",'sum','count','min','max'])

In [None]:
df.groupby(["Survived","Embarked"])["Fare"].count()

### reset_index

In [None]:
xyz = df.groupby(["Survived","Embarked"])["Fare"].count().reset_index()

# change the columns name
# df.rename(columns={'A': 'Age'}, inplace=True)
xyz.rename(columns={'Fare': 'count'}, inplace=True)
xyz

## Sortings


In [None]:
xyz.sort_values(by = "count",ascending = False)

In [None]:
# sort by index
xyz.sort_index()
xyz.sort_index(ascending = False)

### handle Error value


In [None]:
# 2. Handle Error Values
df1 = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Score': [85, 'error', 92]
})
df1

In [None]:
pd.to_numeric(df1['Score'], errors='coerce')

In [None]:
# Replace 'error' with NaN and convert column to numeric
df1['Score'] = pd.to_numeric(df1['Score'], errors='coerce')
df1
# When errors='coerce' is specified in these functions, it dictates how the function handles values that cannot be successfully converted to the target data type. Instead of raising an error and stopping the execution (which is the default behavior, errors='raise'), errors='coerce' will convert any unparseable or invalid values to NaN (Not a Number) for numeric types or NaT (Not a Time) for datetime/timedelta types.


## üîπ Business Scenario 1: Passenger Demographics

**Goal:** Understand the customer base to tailor safety and comfort services.

**Questions:**

1. What is the **average age** of passengers on the Titanic?
2. What **percentage of passengers** were **male vs female**?
3. Which **port (Embarked)** contributed the **most passengers**?
4. How many passengers were **children (age < 18)**, **adults (18‚Äì60)**, and **seniors (60+)**?

---

## üîπ Business Scenario 2: Survival Analysis

**Goal:** Identify which groups were more likely to survive to improve future safety protocols.

**Questions:**

1. What was the **overall survival rate**?
2. Which **gender** had a higher survival rate?
3. How did survival rates differ by **passenger class (Pclass)**?
4. Did **children** have better survival chances than adults?
5. Which **port of embarkation (Embarked)** had the highest survival rate?

---

## üîπ Business Scenario 3: Revenue Insights

**Goal:** Understand ticket pricing patterns and potential revenue drivers.

**Questions:**

1. What was the **average fare per passenger class**?
2. Is there a correlation between **fare** and **survival**?
3. Which **port** generated the **highest total fare revenue**?
4. Identify **top 10 paying customers** (by fare).
5. Did **family size** (SibSp + Parch) affect fare price or survival?

---

## üîπ Business Scenario 4: Customer Segmentation

**Goal:** Segment passengers to target different customer profiles.

**Questions:**

1. Create a new column `FamilySize = SibSp + Parch + 1`.

   * How many passengers traveled **alone** vs **with family**?
2. Compare **survival rates** between solo travelers and those with family.
3. Find the **average fare and survival rate** for each combination of (`Pclass`, `Sex`).
4. Cluster passengers into groups based on **Age**, **Fare**, and **Pclass** (optional, for advanced users).

---

## üîπ Business Scenario 5: Predictive Indicators (Exploratory)

**Goal:** Identify key variables influencing survival for a predictive model.

**Questions:**

1. Which features show the **strongest correlation** with survival?
2. Create a pivot table showing **mean survival rate** by `Pclass` and `Sex`.
3. Which combination of `Pclass` and `Embarked` had the **lowest survival rate**?
4. If the company were to build a new ship, what **demographic group** should they ensure receives better safety measures?

---

## üí° Bonus: Data Cleaning & Preparation Tasks

To mimic real-world work, try:

1. Handle **missing values** in `Age`, `Cabin`, and `Embarked`.
2. Create **bins** for age groups (Child, Teen, Adult, Senior).
3. Extract **titles** (Mr, Mrs, Miss, etc.) from the `Name` column ‚Äî analyze survival by title.
4. Encode categorical columns (`Sex`, `Embarked`) for modeling.


In [None]:
# merge
# join
# concat /vstack/hstack
# pivot
# unpivot/melt
# convert to json /json dataframe
# cut

### 1. Merge -->table join

In [None]:
# student details table
student = pd.DataFrame({
    'ID': [1, 2, 3],
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Score': [85, 90, 95]
})

# subject details table
sub = pd.DataFrame({
    'ID': [2, 3, 4],
    'Subject': ['Math', 'English', 'Science'],
    'Grade': ['A', 'B', 'A'],
    'Name': ['Alice', 'Bob', 'Charlie']
})
print(student)
print("==============")
print(sub)

In [None]:
# merge  use with pandas univeral funcions
# syntax
# pd.merge(left, right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, suffixes=('_x', '_y'))
# select 8 from t1 join t2 on t1.col1 = t2.col1
pd .merge(student, sub, on='ID', how='inner',suffixes=("_std","sub"))
#
merged = pd.merge(student, sub, on='ID', how='left', suffixes=('_std', '_sub'))
print(merged)

In [None]:
# 2.join use with variable  --> default using index as a common column
student.join(sub,how='left',lsuffix='_std', rsuffix='_sub')
# show the join as index values

In [None]:
# Join: Using set_index() and join()
df1 = student.set_index('ID')
df2 = sub.set_index('ID')
df2
joined_df = df1.join(df2, how='left',lsuffix='_std', rsuffix='_sub')
print(joined_df)

### concatincaton // vstack/hstack

In [None]:
pd.concat([student, sub], axis=0, ignore_index=True)
#

In [None]:
pd.concat([student, sub], axis=1)
#
# pd.concat([df1, df2], axis=1) (hstack)
# pd.concat([df1, df2], axis=0) (vstack)

# import numpy as np
# np.stack([df1, df2], axis=1) ## shape must be
# np.concatenate([df1, df2], axis=1)

### Pivot table

In [None]:
# filter, row, column ,values
# pivot table
pivot_data = pd.DataFrame({
    'ID': [1, 1, 2, 2],
    'Subject': ['Math', 'Science', 'Math', 'Science'],
    'Score': [88, 92, 80, 85]
})
pivot_data

In [None]:
pivot_table= pivot_data.pivot(index='ID', columns='Subject', values='Score')
# index =row, columns =column ,values =value
print(pivot_table)

In [None]:
# unpivot /melt
d = pivot_table.reset_index()
pd.melt(d, id_vars=['ID'], value_vars=['Math', 'Science'], var_name='Subject', value_name='Score')
#

In [None]:
# Definitions and Use Cases:
# ------------------------------------------------------------
# concat: Combines DataFrames either vertically (row-wise) or horizontally (column-wise).
#   Use case: Combine data from multiple CSVs or append new data.
# merge: Combines DataFrames based on common columns (similar to SQL joins).
#   Use case: Merge customer info with transaction data.
# join: Joins DataFrames using their index.
#   Use case: Combine metadata indexed by unique IDs.
# pivot: Reshapes data by turning unique values in a column into new columns.
#   Use case: Create summary tables like Excel pivot tables.
# melt (unpivot): Converts wide-format data into long-format.
#   Use case: Prepare data for analysis/visualization by tidying it.

In [None]:
# cut method
# used for segmenting and sorting data values into discrete intervals,
# or "bins." It is particularly useful for transforming a continuous numerical variable into a categorical variable.
# Create a sample DataFrame
data = {'Score': [65, 75, 88, 92, 55, 70, 83, 95]}
df_scores = pd.DataFrame(data)

# Define bins and labels
bins = [0, 60, 70, 80, 90, 100]
labels = ['Fail', 'Pass', 'Good', 'Very Good', 'Excellent']

# Use pd.cut to categorize scores
df_scores['Category'] = pd.cut(df_scores['Score'], bins=bins, labels=labels, right=True)

print(df_scores)

# DateTime

In [None]:
# DateTime
df = pd.DataFrame({"date": ['2024-02-08', '2024-02-09', '2024-02-10']})
df.dtypes
# change the data types
df['date'] = pd.to_datetime(df['date'])
df.dtypes
df["date"].dt.month

#
df["months"] =  df["date"].dt.month
df["days"] =  df["date"].dt.day
df["year"] =  df["date"].dt.year
df

## dataframe to json

json to dataframe

In [None]:
# dataframe to json
df.to_json("data.json")
#
print(df.to_json())

In [None]:
df.to_json("data.json", orient="records")

print(df.to_json( orient="records"))

In [None]:
# json to dataframe
df1 = pd.read_json("data.json")
df1



# 10. visualize the data

In [None]:
# Visualizations in pandas
df1["date"].plot(kind='line')

### Extra Topics or parameter(Optional)

Why use pd.Categorical?

Categorical data saves memory and makes grouping / sorting / modeling more efficient.

1Ô∏è‚É£ Memory efficiency

If you have millions of rows and only 3 unique values (1, 2, 3), pandas internally stores them as integer codes instead of full numeric or string copies

In [None]:
# pd.Categorical() converts a Pandas Series (or list-like object) into a Categorical data type ‚Äî a special kind of data structure
#  in pandas that represents discrete fixed values (categories) instead of plain numeric or string data.
pd.Categorical(df['Pclass'])

# deep copy shallow copy
df3= df
df2 = df.copy()   # copy the dataframe

In [None]:
df.groupby('Survived').mean(numeric_only=True) #for recent pandas version

In [None]:
df.groupby('Survived').describe()

In [None]:
# # Sorting by column "Population"
df.sort_values(by=['Fare'], ascending=False)

In [None]:
import numpy as np
numeric_columns = df.select_dtypes(include=[np.number]).columns
result = df.groupby('Survived')[numeric_columns].aggregate([np.min, 'max', 'mean', 'median', 'count', 'var'])
result

In [None]:
#Q what is the average fare paid by people who survived?
#groupby >> https://www.w3resource.com/python-exercises/pandas/groupby/index.php

df = pd.read_csv("https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv")
df
#To convert the result to dataframe
df.groupby(['Sex', 'Pclass'])['Survived'].sum().to_frame()

df.groupby(['Sex', 'Pclass'])['Survived'].sum().unstack()

## hstack,unstack,to_frame


In [None]:
# # Sorting by column "Population"
df.sort_values(by=['Population'], ascending=False)
# Sorting by columns "Country" and then "Continent"
df.sort_values(by=['Country', 'Continent'])
# Sorting by columns "Country" in descending
# order and then "Continent" in ascending order
df.sort_values(by=['Country', 'Continent'],
               ascending=[False, True])


import pandas as pd
import seaborn as sns

# Load Titanic dataset
titanic = sns.load_dataset('titanic')
titanic.head()

## üîπ Part 1: Grouping and Aggregation

### ‚úÖ Use Case: Find average age of passengers grouped by gender and class

```python
# Group by sex and class, then find average age
grouped = titanic.groupby(['sex', 'class'])['age'].mean().reset_index()
print(grouped)
```

### ‚úÖ Use Case: Count passengers in each class

```python
# Count of passengers in each class
class_counts = titanic['class'].value_counts()
print(class_counts)
```

### ‚úÖ Use Case: Survival rate by gender

```python
# Mean of survived (1 = survived, 0 = did not survive)
survival_rate = titanic.groupby('sex')['survived'].mean().reset_index()
print(survival_rate)
```

### ‚úÖ Use Case: Multiple Aggregations

```python
# Aggregate age with multiple functions
agg_stats = titanic.groupby('class')['age'].agg(['mean', 'min', 'max', 'count']).reset_index()
print(agg_stats)
```

---

## üîπ Part 2: Merging and Joining DataFrames

Let‚Äôs create two example DataFrames from the Titanic dataset for merging and joining.

### ‚úÖ Step 1: Create two dataframes

```python
# Selecting relevant columns
df1 = titanic[['survived', 'sex', 'age']].iloc[:10]
df2 = titanic[['age', 'fare']].iloc[:10]
1
# Add an ID column for joining
df = df1.reset_index().rename(columns={'index': 'id'})
df2 = df2.reset_index().rename(columns={'index': 'id'})
```

### ‚úÖ Merge: Inner Join on `id`

```python
merged_inner = pd.merge(df1, df2, on='id', how='inner')
print(merged_inner)
```

### ‚úÖ Merge: Left Join

```python
merged_left = pd.merge(df1, df2, on='id', how='left')
print(merged_left)
```

### ‚úÖ Join: Using `set_index()` and `join()`

```python
df1_indexed = df1.set_index('id')
df2_indexed = df2.set_index('id')

joined_df = df1_indexed.join(df2_indexed, how='inner')
print(joined_df)
```

---

## üß† Summary

| Task                  | Function                              |
| --------------------- | ------------------------------------- |
| Grouping by column(s) | `groupby()`                           |
| Aggregation           | `agg()`, `mean()`, `sum()`, `count()` |
| Merge two dataframes  | `pd.merge()`                          |
| Join on index         | `df1.join(df2)`                       |

---

Would you like this in a **teaching format with assignments** or a **Jupyter Notebook**?


Great! Let's now cover **concatenation** using Pandas and the **Titanic dataset**.

---

## üîπ Part 3: Concatenation in Pandas

### ‚úÖ What is Concatenation?

**Concatenation** means **combining multiple DataFrames either vertically (row-wise)** or **horizontally (column-wise)**.

The function used is:

```python
pd.concat([df1, df2], axis=0 or 1)
```

---

### ‚ñ∂Ô∏è Example 1: **Row-wise Concatenation** (Vertical)

Use Case: Combine first 5 rows and next 5 rows of Titanic dataset.

```python
import pandas as pd
import seaborn as sns

# Load Titanic dataset
titanic = sns.load_dataset('titanic')

# Create two subsets
df_top = titanic.iloc[:5]
df_bottom = titanic.iloc[5:10]

# Concatenate row-wise (axis=0)
df_vertical = pd.concat([df_top, df_bottom], axis=0)
print(df_vertical)
```

üìå Default is `axis=0` which stacks rows.
‚úÖ The column names must match for proper stacking.

---

### ‚ñ∂Ô∏è Example 2: **Column-wise Concatenation** (Horizontal)

Use Case: Combine two DataFrames side by side.

```python
# Select 5 rows of different columns
df1 = titanic[['survived', 'sex']].iloc[:5]
df2 = titanic[['age', 'fare']].iloc[:5]

# Concatenate column-wise (axis=1)
df_horizontal = pd.concat([df1, df2], axis=1)
print(df_horizontal)
```

üìå `axis=1` joins DataFrames **side-by-side**, like adding new features.

---

### ‚ñ∂Ô∏è Example 3: Concatenation with `ignore_index=True`

Use Case: Reset the index after concatenation.

```python
df_combined = pd.concat([df_top, df_bottom], axis=0, ignore_index=True)
print(df_combined)
```

üìå `ignore_index=True` resets the row indices in the new DataFrame.

---

### üß† Summary Table

| Type        | Axis                | Description     |
| ----------- | ------------------- | --------------- |
| Vertical    | `axis=0`            | Appends rows    |
| Horizontal  | `axis=1`            | Appends columns |
| Reset index | `ignore_index=True` | Renumbers rows  |

---

Would you like **assignments** or **use-case-based practice questions** using `concat`?


### üîπ Part 4: Convert Pandas DataFrame to JSON

Pandas makes it easy to convert a DataFrame to JSON format using the `.to_json()` method.

---

### ‚úÖ Syntax

```python
df.to_json(path_or_buf=None, orient=None, lines=False)
```

---

### ‚ñ∂Ô∏è Example 1: Convert Titanic DataFrame to JSON string

```python
import pandas as pd
import seaborn as sns

# Load dataset
titanic = sns.load_dataset('titanic')

# Convert first 5 rows to JSON
json_data = titanic.head().to_json()
print(json_data)
```

---

### ‚ñ∂Ô∏è Example 2: Save DataFrame as JSON file

```python
# Save first 5 rows to a JSON file
titanic.head().to_json("titanic_sample.json", orient="records", lines=True)
```

üìÇ This will create a JSON file with each record on a new line (useful for large datasets).

---

### üîÅ Different `orient` options in `.to_json()`

| Orient      | Description                    | Output Format                       |
| ----------- | ------------------------------ | ----------------------------------- |
| `'split'`   | Dict with index, columns, data | `{index, columns, data}`            |
| `'records'` | List like row-wise dicts       | `[{"col1":val1, "col2":val2}, ...]` |
| `'index'`   | Dict of dicts (index as key)   | `{index: {col:val}}`                |
| `'columns'` | Dict of columns                | `{col: {index:val}}`                |
| `'values'`  | Just the data as list of lists | `[[...], [...]]`                    |
| `'table'`   | JSON Table Schema              | Complex (used in API)               |

---

### ‚ñ∂Ô∏è Example 3: Use different `orient`

```python
# Records orientation
json_records = titanic.head().to_json(orient='records')
print(json_records)

# Split orientation
json_split = titanic.head().to_json(orient='split')
print(json_split)
```

---

### üß† Summary

| Task                   | Code                                    |
| ---------------------- | --------------------------------------- |
| Convert to JSON string | `df.to_json()`                          |
| Save to JSON file      | `df.to_json("file.json")`               |
| Pretty JSON            | Use `json.dumps()` with `indent=4`      |
| Control format         | Use `orient='records'`, `'split'`, etc. |

---

Would you like a **sample assignment** or **real-world use case** involving JSON export (e.g., preparing data for a REST API)?


Great! Let's now learn how to **load JSON data into a Pandas DataFrame** ‚Äî the reverse of exporting.

---

## üîπ Part 5: Convert JSON to Pandas DataFrame

### ‚úÖ Common Use Case

You receive a **JSON file** (e.g., from a web API or data export) and need to **analyze it using Pandas**.

---

### ‚ñ∂Ô∏è Example 1: Read JSON String

```python
import pandas as pd
import json

# Sample JSON string (records orientation)
json_str = '''
[
    {"name": "Alice", "age": 25, "city": "New York"},
    {"name": "Bob", "age": 30, "city": "Paris"},
    {"name": "Charlie", "age": 28, "city": "London"}
]
'''

# Convert JSON string to DataFrame
df = pd.read_json(json_str)
print(df)
```

---

### ‚ñ∂Ô∏è Example 2: Read JSON File (from disk)

```python
# Load JSON file into DataFrame
df_json = pd.read_json('female_survivors.json', lines=True)
print(df_json.head())
```

üîπ `lines=True` is **required** if each row is a separate JSON object (NDJSON format).

---

### ‚ñ∂Ô∏è Example 3: Read Nested JSON (Using `json_normalize`)

```python
from pandas import json_normalize

# Sample nested JSON
nested_json = {
    "department": "IT",
    "employees": [
        {"name": "Alice", "role": "Developer"},
        {"name": "Bob", "role": "Manager"}
    ]
}

# Normalize nested list
df_nested = json_normalize(nested_json, 'employees', meta='department')
print(df_nested)
```

---

### üß† Summary Table

| Task                          | Function                   | Notes                      |
| ----------------------------- | -------------------------- | -------------------------- |
| JSON string to DataFrame      | `pd.read_json()`           | From string or file        |
| NDJSON to DataFrame           | `pd.read_json(lines=True)` | Each line is a JSON object |
| Nested JSON to flat DataFrame | `json_normalize()`         | Extract nested fields      |

---

## üìù Assignments: JSON ‚Üí Pandas

### üéØ **Assignment 1: Basic File Import**

| Task    | Instructions                                              |
| ------- | --------------------------------------------------------- |
| File    | Use `children_passengers.json` (from previous assignment) |
| Load    | Use `pd.read_json()` with `lines=True`                    |
| Display | Show first 5 rows                                         |

---

### üéØ **Assignment 2: Normalize Nested JSON**

| Task        | Instructions                                                                                                                     |
| ----------- | -------------------------------------------------------------------------------------------------------------------------------- |
| Create JSON | Create a Python dictionary like:<br>`{"team": "Data", "members": [{"name": "A", "skill": "ML"}, {"name": "B", "skill": "SQL"}]}` |
| Convert     | Use `json_normalize()` to extract `members` with `team` as meta                                                                  |
| Display     | Print the DataFrame                                                                                                              |

---

Would you like a **complete Jupyter Notebook** combining all parts (Export + Import JSON using Titanic data)?


Here‚Äôs a detailed explanation of each of the provided data manipulation techniques in Pandas:

### 1. **Pivot the DataFrame**

The `pivot_table()` function reshapes the data by summarizing it. It allows you to group by one or more columns and calculate aggregate values (e.g., mean, sum).

#### Syntax:
```python
pivot_table = df.pivot_table(values='Fare', index='Pclass', columns='Sex', aggfunc='mean')
```

- **`values='Fare'`**: The column we want to aggregate.
- **`index='Pclass'`**: Rows will be grouped by the `Pclass` (Passenger Class) column.
- **`columns='Sex'`**: The unique values in the `Sex` column will become the columns in the resulting table (i.e., Male, Female).
- **`aggfunc='mean'`**: The aggregation function, in this case, calculates the **mean** fare for each combination of `Pclass` and `Sex`.

#### Example Output:
| Sex    | Female | Male |
|--------|--------|------|
| Pclass |        |      |
| 1      | 100.0  | 80.0 |
| 2      | 40.0   | 20.0 |
| 3      | 10.0   | 5.0  |

This gives the **mean fare** for each combination of **Passenger Class (`Pclass`)** and **Sex**.

### 2. **Melt the DataFrame (Unpivot)**

The `pd.melt()` function reshapes the DataFrame from wide format to long format (unpivot). It is useful when you want to convert multiple columns into rows.

#### Syntax:
```python
df_melted = pd.melt(df, id_vars=['Pclass'], value_vars=['Age', 'Fare'])
```

- **`id_vars=['Pclass']`**: These columns are kept intact.
- **`value_vars=['Age', 'Fare']`**: These columns will be "melted" into one single column of values, creating a long-form DataFrame.

#### Example Output:

| Pclass | variable | value |
|--------|----------|-------|
| 1      | Age      | 22    |
| 1      | Fare     | 71.0  |
| 2      | Age      | 26    |
| 2      | Fare     | 12.0  |

The new DataFrame is in a **long** format, with each row representing a single observation for `Age` or `Fare`.

### 3. **Group by 'Pclass' and Calculate Mean Fare**

The `groupby()` function groups the DataFrame by one or more columns and allows you to apply aggregation functions like `mean`, `sum`, etc.

#### Syntax:
```python
df_grouped = df.groupby('Pclass')['Fare'].mean()
```

- **`groupby('Pclass')`**: Groups the data by `Pclass` (Passenger Class).
- **`['Fare']`**: Selects the `Fare` column.
- **`.mean()`**: Calculates the **mean fare** for each group.

#### Example Output:

| Pclass | Fare  |
|--------|-------|
| 1      | 84.0  |
| 2      | 20.0  |
| 3      | 13.0  |

This shows the **mean fare** paid by passengers in each class (`Pclass`).

### 4. **Sort by 'Age' Column**

The `sort_values()` function sorts the DataFrame by one or more columns.

#### Syntax:
```python
df_sorted = df.sort_values(by='Age', ascending=False)
```

- **`by='Age'`**: Specifies the column by which to sort the DataFrame (here, `Age`).
- **`ascending=False`**: Sorts the data in **descending** order, so the oldest passengers appear first.

#### Example Output:

| Name  | Age | Fare |
|-------|-----|------|
| John  | 80  | 100  |
| Jane  | 60  | 50   |
| Mark  | 45  | 20   |

This sorts the DataFrame by **Age** in descending order, so the oldest passengers are listed first.

---

### Summary:
- **Pivot Table**: Reshapes the DataFrame by summarizing values based on rows and columns.
- **Melt**: Unpivots the DataFrame, converting wide format into long format.
- **GroupBy**: Groups the data by one or more columns and applies aggregation functions (like `mean`).
- **Sort**: Sorts the DataFrame by specified columns in either ascending or descending order.

These operations are fundamental for transforming and analyzing data in Pandas.

In [None]:
# Step 7: Data Transformation
# Pivot the DataFrame
pivot_table = df.pivot_table(values='Fare', index='Pclass', columns='Sex', aggfunc='mean')

# Melt the DataFrame (unpivot)
df_melted = pd.melt(df, id_vars=['Pclass'], value_vars=['Age', 'Fare'])

# Group by 'Pclass' and calculate mean fare
df_grouped = df.groupby('Pclass')['Fare'].mean()

# Sort by 'Age' column
df_sorted = df.sort_values(by='Age', ascending=False)


In [None]:
# plotting
d = pd.Series([1, 2, 8, 4, 5, 6])
d.plot()

https://pandas.pydata.org/docs/user_guide/visualization.html

pd.set_option("display.max_colwidth", 1000)