### Introduction to Pandas

Pandas is a Python library used for data manipulation and analysis. Pandas provides a convenient way to analyze and clean data.

The Pandas library introduces two new data structures to Python - Series and DataFrame, both of which are built on top of NumPy.

### What is Pandas Used for?

Pandas is a powerful library generally used for:

- Data Cleaning
- Data Transformation
- Data Analysis
- Machine Learning
- Data Visualization

### Why Use Pandas?

Some of the reasons why we should use Pandas are as follows:

1. **Handle Large Data Efficiently**

   Pandas is designed for handling large datasets. It provides powerful tools that simplify tasks like data filtering, transforming, and merging.

   It also provides built-in functions to work with formats like CSV, JSON, TXT, Excel, and SQL databases.

2. **Tabular Data Representation**

   Pandas DataFrames, the primary data structure of Pandas, handle data in tabular format. This allows easy indexing, selecting, replacing, and slicing of data.

3. **Data Cleaning and Preprocessing**

   Data cleaning and preprocessing are essential steps in the data analysis pipeline, and Pandas provides powerful tools to facilitate these tasks. It has methods for handling missing values, removing duplicates, handling outliers, data normalization, etc.

4. **Time Series Functionality**

   Pandas contains an extensive set of tools for working with dates, times, and time-indexed data as it was initially developed for financial modeling.

5. **Free and Open-Source**

   Pandas follows the same principles as Python, allowing you to use and distribute Pandas for free, even for commercial use.


### Install Pandas

To install pandas, you need Python and PIP installed on your system. If you have Python and PIP installed already, you can install pandas by entering the following command in the terminal:

```bash
pip install pandas
```

If the installation completes without any errors, Pandas is now successfully installed on your system. You can start using it in your Python projects by importing the Pandas library.

### Import Pandas in Python

We can import Pandas in Python using the import statement:

```python
import pandas as pd
```


In [1]:
# pip install pandas

In [3]:
import pandas
pandas.__version__

'2.2.3'

In [None]:

#pandas>>data manipulation and data wrangling
# create Series, DataFrame ,indexing, columns, data types, assign, create new columns

# data cleaning
# drop columns,  drop rows, fill missing values, handle outliers, remove duplicates


# data transformation
# pivot, melt, groupby, sort, sortby, rank, quantile, shift,
# data merging
# join, merge, concat, append
# data analysis
# summary statistics, descriptive statistics, correlation, regression, time series analysis
# data visualization
# plot, scatter plot, bar plot, histogram, box plot, violin plot, heatmap
# data export
# to_csv, to_excel, to_json, to_pickle, to_sql



# A Pandas Series is a one-dimensional labeled array-like object that can hold data of any type.
### Labels

# The labels in the Pandas Series are 
# index numbers by default. Like in DataFrame and array, the index number in Series starts from 0.


### Series and dataframe (Indexing & Slicing)
##### Series

In [None]:
# creaate the Series using list
import pandas as pd
li = [34,3,534,34,23]
se = pd.Series(li)
se[2]
se[2:4]

In [None]:
# Create a Series and specify labels
se2 = pd.Series([12,32,243,45] , index=['a','b','c','d'])
se2["c"]
# reasign 
se2["c"] = 24


## DataFrame

In [None]:

#  data frame
pd.DataFrame([[1,2,3],[4,5,6],[7,8,9]])

df = pd.DataFrame([[1,2,3],[4,5,6],[7,8,9]], index=['a','b','c'], columns=['x','y','z'])
df
pd.DataFrame({"x":[1,2,3],"y":[4,5,6],"z":[7,8,9]})

s = pd.Series(list(df['Name'][2:5]), index = ['a', 'b', 'c'])
s
s1 = pd.Series(list(df['Name'][5:8]))
s1
s+s1 #series or dataframe doesnt work with +
s.append(s1)
type(df)

#data structure>> series, dataframe
#series> 1 dimensional in nature
#dataframe> 2 dimensional in nature , multiple series constitute to form a dataframe

df.dtypes
df.shape
df
pd.Series([2, 3, 4], index = [100, "ajay", 2])
d = pd.DataFrame(pd.Series([2, 3, 4], index = [100, "ajay", 2]))
d[1] = "Anuj"
d

#### Load the Data from Files

In [None]:
import pandas as pd
# load Diffrent Files
df = pd.read_csv("https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv")
df

Unnamed: 0,version https://git-lfs.github.com/spec/v1
0,oid sha256:567be8c7e70a068367cdedd0728bf4f98b6...
1,size 19104


In [None]:
# load Diffrent Files
df = pd.read_csv("services.csv")
ex = pd.read_excel("basic_data_excel.xlsx")
ex
df
df3 = pd.read_csv("https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv")
df3
df3.dtypes
df3.shape
df3.info()
df3[['Sex']]

df3.columns
df4 = df3[['Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare']]
df4.to_csv("test.csv", index = False)

# pip install lxml
import lxml
url_df = pd.read_html("https://www.basketball-reference.com/leagues/NBA_2015_totals.html")
type(url_df)
len(url_df)
df4 = url_df[0]
df4
df4.head()
df4.shape
df4.info()
df4.to_csv("players.csv", index=False)

json ="https://api.github.com/repos/pandas-dev/pandas/issues"

for i in range(len(df)):
    print(df[i]['user']['node_id'])

df = pd.DataFrame(df, columns = ['user', 'timeline_url'])

# df.to_csv('json_info.csv')
#Solving the dataset

# mysql connector
import mysql.connector  
#Create the connection object   
myconn = mysql.connector.connect(host = "127.0.0.1", user = "root",passwd = "Azsxdcf123@" ,database = "mavenmovies")  
  
#creating the cursor object  
cur = myconn.cursor()   
try:  
    #Reading the Employee data      
    cur.execute("select * from Actor")  
  
    #fetching the rows from the cursor object  
    result = cur.fetchall()  
    #printing the result  
  
    for x in result:  
        print(x);  
except:  
    myconn.rollback()  
  
myconn.close()  

In [None]:
# 2d slicing 
df2[["a","c"]][1:]   #but not using slicing in 2d

# loc 
# df.loc[rows , columns]
df2.loc["y"]

# df.iloc[rows , columns]
df2.iloc[1: ,1:3]  # index base slicing 


### Data Explore

In [None]:
# ================================== 
df = pd.read_csv("https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv")
df
df.columns
df.head()
df.tail()
df.dtypes
df.size
df.sample(1)
df.columns
list(df.columns)
df.info()
df.dtypes
df.shape
df.describe() #numerical data

type(df['id'])

# df.numeric.describe()
df.describe()
df[['PassengerId', 'Survived', 'Pclass']]
df.describe(include = 'object')
df.describe(include = 'all')
df.astype('object').describe()

df.dtypes == 'object'
df.dtypes
df.dtypes[df.dtypes == 'object'].index
df[df.dtypes[df.dtypes == 'object'].index]
df[df.columns[df.dtypes == 'object']]

df[df.dtypes[df.dtypes != 'object'].index].describe()
df[df.columns[df.dtypes != 'object']].describe()
df[df.dtypes[df.dtypes == 'object'].index].describe()


# df.numeric.describe()
df.describe()
df
# Slicing : indexing  by label or position
df[10:100:5]
pd.Categorical(df['Pclass'])
pd.Categorical(df['Cabin'])
df['Cabin'].unique()
df['Cabin'].nunique()
df['Cabin'].value_counts()
df
df.head()
# deep copy shallow copy
df3= df
df2 = df.copy()   # copy the dataframe

# Slicing name basis
df4.loc[0:4]   # row  slicing
df4.loc[0:4,"Name":"Fare"] # rows and  columns slice
df4.loc[0:4,["Name","Fare"]]

# df.iloc[rows , columns]
df2.iloc[1: ,1:3]  # index base slicing 

# Slicing index
df4.iloc[0:4]   # row  slicing
df4.iloc[0:4,2:6]
df4.iloc[0:4,[2,5]]

# create new cols and add into df4
df4["new col"] =0

#Access df rows #implicit index>internal/integer index and explicit index/named>>define
df[0:100]
df.iloc[0:2] #start from 0 and go to 1
df.loc[0:2] #give me the rows whose name is 0, 1, 2

#loc will go with named indexes, iloc will go with inbuilt index
df.iloc[0:2, ['Name', 'Sex', 'Age']] #it will throw an error, why?
df.loc[0:2, ['Name', 'Sex', 'Age']]

df
df.iloc[0:2, 3:6]
list(df['Name'][2:5])


df1[['name']].dropna(axis = 1) #for one column
df1.fillna("missing_value_here>lalalalalala", inplace = True)
df1

In [None]:
# replace value example
df.loc[7, 'Duration'] = 45

# If the value is higher than 120, set it to 120:

for x in df.index:
  if df.loc[x, "Duration"] > 120:
    df.loc[x, "Duration"] = 120
# Delete rows where "Duration" is higher than 120:

for x in df.index:
  if df.loc[x, "Duration"] > 120:
    df.drop(x, inplace = True)

In [None]:
# Data Cleaning
df4.duplicated()
df4.duplicated().sum()
df4.drop_duplicates()

df4.isna()
df4.isna().sum()
df4.drop(columns=["Unnamed: 0","Cabin"] ,axis=1) 
df4.drop(columns=["Unnamed: 0","Cabin"] ,axis=1,inplace=True)
df4.isnull().sum()

df.drop(1, inplace=True)
df
df.set_index('Name', inplace = True) #time series data>>you will be making date time column as index
df
df.reset_index(inplace = True)
df

# use aggrigate funtions
df4["Age"].mean()   # for replace value find mean

#  fill null age value # replace missing values with 0
df4["Age"].fillna(df4["Age"].mean(),  inplace=True)

df4.dropna(inplace=True)

# change the datatypes
df4["SibSp"].astype("float32")
# data types change
df4["SibSp"] = df4["SibSp"].astype("float32")

In [None]:


# remove duplicates based on column 'A'
df.drop_duplicates(subset=['A'], keep='first', inplace=True)

# rename columns
df.rename(columns={'A': 'Age', 'B': 'Name', 'C': 'Salary'}, inplace=True)

# remove rows with missing values
df.dropna(inplace=True)

#  How to remove columns containing only NaN values?
# check which columns contain only NaN values
columns_with_nan = df.columns[df.isnull().all()]

# drop the columns containing only NaN values
df = df.drop(columns=columns_with_nan)




In [4]:
# Questions form dataset

#Q. How many passengers are less than 5 years old
df2.drop(["Unnamed: 0"], inplace=True,axis=1)
# unique value count  of age

df2['Age'].value_counts()

df['Age'] < 5
df[df['Age'] < 5]

len(df[df['Age'] < 5])

#no of passenger >18

len(df[df['Age'] > 18])

#how many passengers are less than 18 years old

len(df) - len(df[df['Age'] > 18])

len(df[df['Age'] <= 18]) #missing value in age column

#Q. How many passengers have paid less than avg fare

df['Fare'].mean()
df[df['Fare']<df['Fare'].mean()]
len(df[df['Fare'] > df['Fare'].mean()])
#How many passengers paid 0 fare
list(df[df['Fare'] == 0].Name)

#Qhow many passengers are male and female
len(df[df['Sex'] == "male"])
len(df[df['Sex'] == "female"])
df['Sex'].value_counts(normalize = True)

#Q how many passengers of class 1
df[df['Pclass'] == 1]

#How many passengers survived
df[df['Survived'] == 1]
df['Survived'].value_counts(normalize = True)

#How many females paid more than avg fare
df['Sex'] == 'female'
df['Fare'].mean()
df[(df['Sex'] == 'female') & (df['Fare'] > df['Fare'].mean())]

#Q how many passengers are male or who paid greater than avg fare >>or
#Qhow many male passenger paid more than avg >>and
df[(df['Sex'] == 'male') | (df['Fare'] > df['Fare'].mean())]

np.mean(df['Fare'])
df['Fare'].mean()
max(df['Fare'])
min(df['Fare'])

#who are the passengers who paid maximum fare
df[df['Fare'] == max(df['Fare'])]['Name']

# Q. How many passenger have parch greater than 3
# Q. How many passenger who survived paid the maximum fare
# Q. How many passengers who didnt survived was from class 1
# Q. How many passengers are children(<5 years old)



NameError: name 'df2' is not defined

In [1]:
# 2. Handle Error Values
df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Score': [85, 'error', 92]
})

# Replace 'error' with NaN and convert column to numeric
df['Score'] = pd.to_numeric(df['Score'], errors='coerce')


NameError: name 'pd' is not defined

In [None]:
# Rename columns
df.rename(columns={'A': 'Age'}, inplace=True)

In [7]:

df = pd.read_csv("https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv")
df
df1=df.copy()

# group by for single columns
df.groupby(["customer segment"])["Fare"].sum()

df.groupby('Survived').mean(numeric_only=True) #for recent pandas version


Unnamed: 0_level_0,PassengerId,Pclass,Age,SibSp,Parch,Fare
Survived,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,447.016393,2.531876,30.626179,0.553734,0.32969,22.117887
1,444.368421,1.950292,28.34369,0.473684,0.464912,48.395408


In [None]:
df.groupby(["Sex","customer segment"])["Fare"].sum()

In [None]:
df.groupby(["customer segment"])["Fare"].aggregate(["mean","count","min","max","sum"])

In [8]:
df.groupby('Survived').sum()

Unnamed: 0_level_0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
Survived,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
0,245412,1390,"Braund, Mr. Owen HarrisAllen, Mr. William Henr...",malemalemalemalemalemalemalefemalemalefemalema...,12985.5,304,181,A/5 2117137345033087717463349909A/5. 215134708...,12142.7199,E46C23 C25 C27B30C83F G73E31A5D26C110B58 B60D2...,SSQSSSSSQSSSCSSCSCSSSSSCSQCSSSCCSCSSCSSSSSCSSS...
1,151974,667,"Cumings, Mrs. John Bradley (Florence Briggs Th...",femalefemalefemalefemalefemalefemalefemalefema...,8219.67,162,159,PC 17599STON/O2. 3101282113803347742237736PP 9...,16551.2294,C85C123G6C103D56A6B78D33C52B28F33C23 C25 C27D1...,CSSSCSSSSCSQSSQCQCCCQQCSSSSCSSSSSSQSSSCSSSQSCS...


In [None]:
df.groupby('Survived').describe()
df.groupby('Survived').aggregate([min, 'max', 'mean', 'median', 'count', 'var'])


In [None]:
# # Sorting by column "Population"
df.sort_values(by=['Fare'], ascending=False)

: 

In [18]:
import numpy as np
numeric_columns = df.select_dtypes(include=[np.number]).columns
result = df.groupby('Survived')[numeric_columns].aggregate([np.min, 'max', 'mean', 'median', 'count', 'var'])
result

  result = df.groupby('Survived')[numeric_columns].aggregate([np.min, 'max', 'mean', 'median', 'count', 'var'])


Unnamed: 0_level_0,PassengerId,PassengerId,PassengerId,PassengerId,PassengerId,PassengerId,Survived,Survived,Survived,Survived,...,Parch,Parch,Parch,Parch,Fare,Fare,Fare,Fare,Fare,Fare
Unnamed: 0_level_1,min,max,mean,median,count,var,min,max,mean,median,...,mean,median,count,var,min,max,mean,median,count,var
Survived,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
0,1,891,447.016393,455.0,549,67933.45411,0,0,0.0,0.0,...,0.32969,0.0,549,0.677602,0.0,263.0,22.117887,10.5,549,985.219509
1,2,890,444.368421,439.5,342,63684.984102,1,1,1.0,1.0,...,0.464912,0.0,342,0.595539,0.0,512.3292,48.395408,26.0,342,4435.160158


In [None]:
#Q what is the average fare paid by people who survived?
#groupby >> https://www.w3resource.com/python-exercises/pandas/groupby/index.php

df = pd.read_csv("https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv")
df
df1=df.copy()
df.groupby('Survived').mean(numeric_only=True) #for recent pandas version

df.groupby('Survived').mean()
df.groupby('Survived').min()
df.groupby('Survived').sum()
df.groupby('Survived').mean(numeric_only=True)
df.groupby('Survived').mean()
df.groupby('Survived').describe()
df.groupby('Survived').aggregate([min, 'max', 'mean', 'median', 'count', np.std, 'var'])

#groupby with two columns
#Q. Total people for each sex, pclass
df1.groupby(['Sex', 'Pclass'])['Survived'].sum()

#To convert the result to dataframe
df.groupby(['Sex', 'Pclass'])['Survived'].sum().to_frame()

df.groupby(['Sex', 'Pclass'])['Survived'].sum().unstack()

In [None]:
df1 = df.groupby('Pclass').sum(numeric_only = True)
df1
df1.T
df1.transpose()

df_1 = df[['Name', 'Sex', 'Age']][0:5]
df_1
df_2 =  df[['Name', 'Sex', 'Age']][5:10]
df_2.reset_index(drop = True, inplace = True)
result = pd.concat([df_1, df_2], axis = 0)
result
pd.concat([df_1, df_2], axis = 1)

## Apply functions

In [None]:
# pd.concat
#apply
df['len_name'] = df['Name'].apply(len)

df
# convert dollor in rupee
def convert(x):
    return x*90

df['Fare_1'] = df['Fare'].apply(convert)

#  ============================
def create_flag(x):
    if x < 10:
        return "cheap"
    elif x >= 10 and x <20:
          return "medium"
    else:
        return "high"

df["flag_fare"] = df['Fare'].apply(create_flag)

df

df1.set_index('c', inplace = True)
df.sort_values(by = "Fare")

# # Sorting by column "Population"
df.sort_values(by=['Population'], ascending=False)
# Sorting by columns "Country" and then "Continent"
df.sort_values(by=['Country', 'Continent'])
# Sorting by columns "Country" in descending
# order and then "Continent" in ascending order
df.sort_values(by=['Country', 'Continent'],
               ascending=[False, True])


## grouping, aggregation, merging, and **joining**

import pandas as pd
import seaborn as sns

# Load Titanic dataset
titanic = sns.load_dataset('titanic')
titanic.head()

## 🔹 Part 1: Grouping and Aggregation

### ✅ Use Case: Find average age of passengers grouped by gender and class

```python
# Group by sex and class, then find average age
grouped = titanic.groupby(['sex', 'class'])['age'].mean().reset_index()
print(grouped)
```

### ✅ Use Case: Count passengers in each class

```python
# Count of passengers in each class
class_counts = titanic['class'].value_counts()
print(class_counts)
```

### ✅ Use Case: Survival rate by gender

```python
# Mean of survived (1 = survived, 0 = did not survive)
survival_rate = titanic.groupby('sex')['survived'].mean().reset_index()
print(survival_rate)
```

### ✅ Use Case: Multiple Aggregations

```python
# Aggregate age with multiple functions
agg_stats = titanic.groupby('class')['age'].agg(['mean', 'min', 'max', 'count']).reset_index()
print(agg_stats)
```

---

## 🔹 Part 2: Merging and Joining DataFrames

Let’s create two example DataFrames from the Titanic dataset for merging and joining.

### ✅ Step 1: Create two dataframes

```python
# Selecting relevant columns
df1 = titanic[['survived', 'sex', 'age']].iloc[:10]
df2 = titanic[['age', 'fare']].iloc[:10]

# Add an ID column for joining
df1 = df1.reset_index().rename(columns={'index': 'id'})
df2 = df2.reset_index().rename(columns={'index': 'id'})
```

### ✅ Merge: Inner Join on `id`

```python
merged_inner = pd.merge(df1, df2, on='id', how='inner')
print(merged_inner)
```

### ✅ Merge: Left Join

```python
merged_left = pd.merge(df1, df2, on='id', how='left')
print(merged_left)
```

### ✅ Join: Using `set_index()` and `join()`

```python
df1_indexed = df1.set_index('id')
df2_indexed = df2.set_index('id')

joined_df = df1_indexed.join(df2_indexed, how='inner')
print(joined_df)
```

---

## 🧠 Summary

| Task                  | Function                              |
| --------------------- | ------------------------------------- |
| Grouping by column(s) | `groupby()`                           |
| Aggregation           | `agg()`, `mean()`, `sum()`, `count()` |
| Merge two dataframes  | `pd.merge()`                          |
| Join on index         | `df1.join(df2)`                       |

---

Would you like this in a **teaching format with assignments** or a **Jupyter Notebook**?


## concatenation

In [None]:
# Definitions and Use Cases:
# ------------------------------------------------------------
# concat: Combines DataFrames either vertically (row-wise) or horizontally (column-wise).
#   Use case: Combine data from multiple CSVs or append new data.
# merge: Combines DataFrames based on common columns (similar to SQL joins).
#   Use case: Merge customer info with transaction data.
# join: Joins DataFrames using their index.
#   Use case: Combine metadata indexed by unique IDs.
# pivot: Reshapes data by turning unique values in a column into new columns.
#   Use case: Create summary tables like Excel pivot tables.
# melt (unpivot): Converts wide-format data into long-format.
#   Use case: Prepare data for analysis/visualization by tidying it.

In [None]:

### --- Part 0: Basic df1 and df2 for Join, Merge, Concat, Pivot, Melt --- ###

df1 = pd.DataFrame({
    'ID': [1, 2, 3],
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Score': [85, 90, 95]
})

df2 = pd.DataFrame({
    'ID': [2, 3, 4],
    'Subject': ['Math', 'English', 'Science'],
    'Grade': ['A', 'B', 'A']
})

print("\ndf1:")
print(df1)
print("\ndf2:")
print(df2)

In [None]:
### --- Concat Example --- ###

concat_rows = pd.concat([df1, df1], axis=0, ignore_index=True)
concat_cols = pd.concat([df1, df2], axis=1)
print("\nConcat Rows:")
print(concat_rows)
print("\nConcat Columns:")
print(concat_cols)

### --- Merge Example --- ###

merged_inner = pd.merge(df1, df2, on='ID', how='inner')
merged_outer = pd.merge(df1, df2, on='ID', how='outer')
print("\nInner Merge:")
print(merged_inner)
print("\nOuter Merge:")
print(merged_outer)

### --- Join Example --- ###

df1_join = df1.set_index('ID')
df2_join = df2.set_index('ID')
joined_df = df1_join.join(df2_join, how='outer')
print("\nJoin on Index:")
print(joined_df)

### --- Pivot Example --- ###
pivot_data = pd.DataFrame({
    'ID': [1, 1, 2, 2],
    'Subject': ['Math', 'Science', 'Math', 'Science'],
    'Score': [88, 92, 80, 85]
})
pivot_table = pivot_data.pivot(index='ID', columns='Subject', values='Score')
print("\nPivot Table:")
print(pivot_table)

### --- Melt (Unpivot) Example --- ###

melted = pd.melt(pivot_table.reset_index(), id_vars=['ID'], value_vars=['Math', 'Science'], var_name='Subject', value_name='Score')
print("\nMelted Data:")
print(melted)

Great! Let's now cover **concatenation** using Pandas and the **Titanic dataset**.

---

## 🔹 Part 3: Concatenation in Pandas

### ✅ What is Concatenation?

**Concatenation** means **combining multiple DataFrames either vertically (row-wise)** or **horizontally (column-wise)**.

The function used is:

```python
pd.concat([df1, df2], axis=0 or 1)
```

---

### ▶️ Example 1: **Row-wise Concatenation** (Vertical)

Use Case: Combine first 5 rows and next 5 rows of Titanic dataset.

```python
import pandas as pd
import seaborn as sns

# Load Titanic dataset
titanic = sns.load_dataset('titanic')

# Create two subsets
df_top = titanic.iloc[:5]
df_bottom = titanic.iloc[5:10]

# Concatenate row-wise (axis=0)
df_vertical = pd.concat([df_top, df_bottom], axis=0)
print(df_vertical)
```

📌 Default is `axis=0` which stacks rows.
✅ The column names must match for proper stacking.

---

### ▶️ Example 2: **Column-wise Concatenation** (Horizontal)

Use Case: Combine two DataFrames side by side.

```python
# Select 5 rows of different columns
df1 = titanic[['survived', 'sex']].iloc[:5]
df2 = titanic[['age', 'fare']].iloc[:5]

# Concatenate column-wise (axis=1)
df_horizontal = pd.concat([df1, df2], axis=1)
print(df_horizontal)
```

📌 `axis=1` joins DataFrames **side-by-side**, like adding new features.

---

### ▶️ Example 3: Concatenation with `ignore_index=True`

Use Case: Reset the index after concatenation.

```python
df_combined = pd.concat([df_top, df_bottom], axis=0, ignore_index=True)
print(df_combined)
```

📌 `ignore_index=True` resets the row indices in the new DataFrame.

---

### 🧠 Summary Table

| Type        | Axis                | Description     |
| ----------- | ------------------- | --------------- |
| Vertical    | `axis=0`            | Appends rows    |
| Horizontal  | `axis=1`            | Appends columns |
| Reset index | `ignore_index=True` | Renumbers rows  |

---

Would you like **assignments** or **use-case-based practice questions** using `concat`?


In [None]:
#date time
df = pd.DataFrame({"date": ['2024-02-08', '2024-02-09', '2024-02-10']})
df
df.dtypes
df['updated_date'] = pd.to_datetime(df['date'])
df
df.dtypes
df['month'] = df['updated_date'].dt.month
df
df['year'] = df['updated_date'].dt.year
df
df['day'] = df['updated_date'].dt.day
df


# Convert to date:datatype
df['Date'] = pd.to_datetime(df['Date'])


In [None]:
#make dataframe is dictionary
df1 = {"key1": [2, 3, 4, 5],
    "key2": [4, 5, 6, 7],
    "key3": [2, 3, 4, 5]}

df1
df1 = pd.DataFrame(df1)
df1
#make dataframe is dictionary
df2= {"key1": (2, 3, 4, 5),
    "key2": (4, 5, 6, 7),
    "key3": (2, 3, 4, 5)}
df2
df2 = pd.DataFrame(df2)
df2

pd.merge(df1, df2, how = 'left')

#merge
pd.merge(df1, df2, how = 'right')
pd.merge(df1, df2, how = 'left', left_on = 'key2', right_on = 'key4')   

In [None]:
# This is useful for categorizing passengers
# by age groups, which can be helpful for further analysis (e.g., checking survival rates by age group).
df['Age_Group'] = pd.cut(df['Age'], 
                         bins=[0, 12, 18, 60, 80], 
                         labels=['Child', 'Teen', 'Adult', 'Senior'])
# pd.cut() function, which segments the data into bins (ranges) and labels those
# bins with categorical names (e.g., 'Child', 'Teen', 'Adult', 'Senior') based on the values in the Age column.


# Further Analysis:
# You can now analyze data based on these groups.
df['Age_Group'].value_counts()

### 🔹 Part 4: Convert Pandas DataFrame to JSON

Pandas makes it easy to convert a DataFrame to JSON format using the `.to_json()` method.

---

### ✅ Syntax

```python
df.to_json(path_or_buf=None, orient=None, lines=False)
```

---

### ▶️ Example 1: Convert Titanic DataFrame to JSON string

```python
import pandas as pd
import seaborn as sns

# Load dataset
titanic = sns.load_dataset('titanic')

# Convert first 5 rows to JSON
json_data = titanic.head().to_json()
print(json_data)
```

---

### ▶️ Example 2: Save DataFrame as JSON file

```python
# Save first 5 rows to a JSON file
titanic.head().to_json("titanic_sample.json", orient="records", lines=True)
```

📂 This will create a JSON file with each record on a new line (useful for large datasets).

---

### 🔁 Different `orient` options in `.to_json()`

| Orient      | Description                    | Output Format                       |
| ----------- | ------------------------------ | ----------------------------------- |
| `'split'`   | Dict with index, columns, data | `{index, columns, data}`            |
| `'records'` | List like row-wise dicts       | `[{"col1":val1, "col2":val2}, ...]` |
| `'index'`   | Dict of dicts (index as key)   | `{index: {col:val}}`                |
| `'columns'` | Dict of columns                | `{col: {index:val}}`                |
| `'values'`  | Just the data as list of lists | `[[...], [...]]`                    |
| `'table'`   | JSON Table Schema              | Complex (used in API)               |

---

### ▶️ Example 3: Use different `orient`

```python
# Records orientation
json_records = titanic.head().to_json(orient='records')
print(json_records)

# Split orientation
json_split = titanic.head().to_json(orient='split')
print(json_split)
```

---

### 🧠 Summary

| Task                   | Code                                    |
| ---------------------- | --------------------------------------- |
| Convert to JSON string | `df.to_json()`                          |
| Save to JSON file      | `df.to_json("file.json")`               |
| Pretty JSON            | Use `json.dumps()` with `indent=4`      |
| Control format         | Use `orient='records'`, `'split'`, etc. |

---

Would you like a **sample assignment** or **real-world use case** involving JSON export (e.g., preparing data for a REST API)?


Great! Let's now learn how to **load JSON data into a Pandas DataFrame** — the reverse of exporting.

---

## 🔹 Part 5: Convert JSON to Pandas DataFrame

### ✅ Common Use Case

You receive a **JSON file** (e.g., from a web API or data export) and need to **analyze it using Pandas**.

---

### ▶️ Example 1: Read JSON String

```python
import pandas as pd
import json

# Sample JSON string (records orientation)
json_str = '''
[
    {"name": "Alice", "age": 25, "city": "New York"},
    {"name": "Bob", "age": 30, "city": "Paris"},
    {"name": "Charlie", "age": 28, "city": "London"}
]
'''

# Convert JSON string to DataFrame
df = pd.read_json(json_str)
print(df)
```

---

### ▶️ Example 2: Read JSON File (from disk)

```python
# Load JSON file into DataFrame
df_json = pd.read_json('female_survivors.json', lines=True)
print(df_json.head())
```

🔹 `lines=True` is **required** if each row is a separate JSON object (NDJSON format).

---

### ▶️ Example 3: Read Nested JSON (Using `json_normalize`)

```python
from pandas import json_normalize

# Sample nested JSON
nested_json = {
    "department": "IT",
    "employees": [
        {"name": "Alice", "role": "Developer"},
        {"name": "Bob", "role": "Manager"}
    ]
}

# Normalize nested list
df_nested = json_normalize(nested_json, 'employees', meta='department')
print(df_nested)
```

---

### 🧠 Summary Table

| Task                          | Function                   | Notes                      |
| ----------------------------- | -------------------------- | -------------------------- |
| JSON string to DataFrame      | `pd.read_json()`           | From string or file        |
| NDJSON to DataFrame           | `pd.read_json(lines=True)` | Each line is a JSON object |
| Nested JSON to flat DataFrame | `json_normalize()`         | Extract nested fields      |

---

## 📝 Assignments: JSON → Pandas

### 🎯 **Assignment 1: Basic File Import**

| Task    | Instructions                                              |
| ------- | --------------------------------------------------------- |
| File    | Use `children_passengers.json` (from previous assignment) |
| Load    | Use `pd.read_json()` with `lines=True`                    |
| Display | Show first 5 rows                                         |

---

### 🎯 **Assignment 2: Normalize Nested JSON**

| Task        | Instructions                                                                                                                     |
| ----------- | -------------------------------------------------------------------------------------------------------------------------------- |
| Create JSON | Create a Python dictionary like:<br>`{"team": "Data", "members": [{"name": "A", "skill": "ML"}, {"name": "B", "skill": "SQL"}]}` |
| Convert     | Use `json_normalize()` to extract `members` with `team` as meta                                                                  |
| Display     | Print the DataFrame                                                                                                              |

---

Would you like a **complete Jupyter Notebook** combining all parts (Export + Import JSON using Titanic data)?


Here’s a detailed explanation of each of the provided data manipulation techniques in Pandas:

### 1. **Pivot the DataFrame**

The `pivot_table()` function reshapes the data by summarizing it. It allows you to group by one or more columns and calculate aggregate values (e.g., mean, sum).

#### Syntax:
```python
pivot_table = df.pivot_table(values='Fare', index='Pclass', columns='Sex', aggfunc='mean')
```

- **`values='Fare'`**: The column we want to aggregate.
- **`index='Pclass'`**: Rows will be grouped by the `Pclass` (Passenger Class) column.
- **`columns='Sex'`**: The unique values in the `Sex` column will become the columns in the resulting table (i.e., Male, Female).
- **`aggfunc='mean'`**: The aggregation function, in this case, calculates the **mean** fare for each combination of `Pclass` and `Sex`.

#### Example Output:
| Sex    | Female | Male |
|--------|--------|------|
| Pclass |        |      |
| 1      | 100.0  | 80.0 |
| 2      | 40.0   | 20.0 |
| 3      | 10.0   | 5.0  |

This gives the **mean fare** for each combination of **Passenger Class (`Pclass`)** and **Sex**.

### 2. **Melt the DataFrame (Unpivot)**

The `pd.melt()` function reshapes the DataFrame from wide format to long format (unpivot). It is useful when you want to convert multiple columns into rows.

#### Syntax:
```python
df_melted = pd.melt(df, id_vars=['Pclass'], value_vars=['Age', 'Fare'])
```

- **`id_vars=['Pclass']`**: These columns are kept intact.
- **`value_vars=['Age', 'Fare']`**: These columns will be "melted" into one single column of values, creating a long-form DataFrame.

#### Example Output:

| Pclass | variable | value |
|--------|----------|-------|
| 1      | Age      | 22    |
| 1      | Fare     | 71.0  |
| 2      | Age      | 26    |
| 2      | Fare     | 12.0  |

The new DataFrame is in a **long** format, with each row representing a single observation for `Age` or `Fare`.

### 3. **Group by 'Pclass' and Calculate Mean Fare**

The `groupby()` function groups the DataFrame by one or more columns and allows you to apply aggregation functions like `mean`, `sum`, etc.

#### Syntax:
```python
df_grouped = df.groupby('Pclass')['Fare'].mean()
```

- **`groupby('Pclass')`**: Groups the data by `Pclass` (Passenger Class).
- **`['Fare']`**: Selects the `Fare` column.
- **`.mean()`**: Calculates the **mean fare** for each group.

#### Example Output:

| Pclass | Fare  |
|--------|-------|
| 1      | 84.0  |
| 2      | 20.0  |
| 3      | 13.0  |

This shows the **mean fare** paid by passengers in each class (`Pclass`).

### 4. **Sort by 'Age' Column**

The `sort_values()` function sorts the DataFrame by one or more columns.

#### Syntax:
```python
df_sorted = df.sort_values(by='Age', ascending=False)
```

- **`by='Age'`**: Specifies the column by which to sort the DataFrame (here, `Age`).
- **`ascending=False`**: Sorts the data in **descending** order, so the oldest passengers appear first.

#### Example Output:

| Name  | Age | Fare |
|-------|-----|------|
| John  | 80  | 100  |
| Jane  | 60  | 50   |
| Mark  | 45  | 20   |

This sorts the DataFrame by **Age** in descending order, so the oldest passengers are listed first.

---

### Summary:
- **Pivot Table**: Reshapes the DataFrame by summarizing values based on rows and columns.
- **Melt**: Unpivots the DataFrame, converting wide format into long format.
- **GroupBy**: Groups the data by one or more columns and applies aggregation functions (like `mean`).
- **Sort**: Sorts the DataFrame by specified columns in either ascending or descending order.

These operations are fundamental for transforming and analyzing data in Pandas.

In [None]:
# Step 7: Data Transformation
# Pivot the DataFrame
pivot_table = df.pivot_table(values='Fare', index='Pclass', columns='Sex', aggfunc='mean')

# Melt the DataFrame (unpivot)
df_melted = pd.melt(df, id_vars=['Pclass'], value_vars=['Age', 'Fare'])

# Group by 'Pclass' and calculate mean fare
df_grouped = df.groupby('Pclass')['Fare'].mean()

# Sort by 'Age' column
df_sorted = df.sort_values(by='Age', ascending=False)


In [None]:
# plotting
d = pd.Series([1, 2, 8, 4, 5, 6])
d.plot()

https://pandas.pydata.org/docs/user_guide/visualization.html

pd.set_option("display.max_colwidth", 1000)