## Pandas

Data Analysis

Data Analysis is a process of inspecting, cleansing, transforming, and modelling data with the goal of discovering useful information, informing conclusions, and supporting decision-making.

The name ‘Pandas’ has a reference to both “Panel Data” and Python Data Analysis.

Pandas is a Python library used for working with data sets.

It has functions for analyzing, cleaning, exploring, and manipulating data.

It can read and write data structures and different formats like CSV, XML, JSON, ZIP, etc.

Pandas is a powerful data manipulation library widely used for data analysis and data cleaning.

Pandas Data Structures

Pandas provides two primary data structures:

1. Series

One-dimensional labeled array.

Stores homogeneous (same type) data.

2. DataFrame

Two-dimensional data structure.

Looks like a table with rows and columns.

Panel

A 3D container of data (older concept, now deprecated).

Key Features of Pandas

Pandas allows us to analyze big data and make conclusions based on statistical theories.

Pandas can clean messy data sets and make them readable and relevant.

Easy handling of missing data (represented as NaN) in floating-point as well as non-floating-point data.

Size mutability: columns can be inserted and deleted from a DataFrame and higher-dimensional objects.

Provides data set merging and joining, flexible reshaping, and pivoting of data sets.

Provides time-series functionality.

Pandas provides two types of data structures such as Series and DataFrame.

In [1]:
import pandas as pd

Series

Pandas series is nothing but the one dimensional array like object which can hold any datatype. It is simillar to column in a particular table.

In [2]:
# Write a program in python to create a pandas series using list
data = [x for x in range(1,6)]
series = pd.Series(data)
print(f'The series becomes:\n{series}')

The series becomes:
0    1
1    2
2    3
3    4
4    5
dtype: int64


In [3]:
# Write a program in python to create a series using a dictionary
data = {'a':1,'b':2,'c':3}
series_dict = pd.Series(data)
print(f'The series becomes:\n{series_dict}')

The series becomes:
a    1
b    2
c    3
dtype: int64


In [4]:
# Write a program in python to create pandas series using the index given by the user
data = [x for x in range(1,6)]
indexes = ['a','b','c','d','e']
series = pd.Series(data,indexes)
print(f'The series becomes:\n{series}')

The series becomes:
a    1
b    2
c    3
d    4
e    5
dtype: int64


Data Frame

A DataFrame is a two-dimensional labeled data structure in Pandas,similar to a table with rows and columns,used for storing and manipulating structured data.

In [5]:
# Write a program in python to create a dataframe from a dictionary of list
data = {
    'Name':['Subhasish','Muskan','Nibedita'],
    'Age':[24,25,24],
    'city':['Burla','Sambalpur','Paradeep']
}
df = pd.DataFrame(data)
print(f'The dataframe becomes:\n {df}')

The dataframe becomes:
         Name  Age       city
0  Subhasish   24      Burla
1     Muskan   25  Sambalpur
2   Nibedita   24   Paradeep


In [6]:
# Write a program in python to create a dataframe from list of dictionaries
data = [
    {'Name':'Krish','Age':32,'City':'Banglore'},
    {'Name':'Jhon','Age':34,'City':'Banglore'},
    {'Name':'Bappy','Age':32,'City':'Banglore'},
    {'Name':'Jack','Age':32,'City':'Banglore'}
]
df = pd.DataFrame(data)
print(f'The dataframe becomes:\n {df}')
print(f'The type of the dataframe becomes:\n {type(df)}')

The dataframe becomes:
     Name  Age      City
0  Krish   32  Banglore
1   Jhon   34  Banglore
2  Bappy   32  Banglore
3   Jack   32  Banglore
The type of the dataframe becomes:
 <class 'pandas.core.frame.DataFrame'>


In [7]:
# Write a program in python to convert a data frame into a numpy array
import numpy as np
arr = np.array(df)
print(f'The array becomes:\n{arr}')
print(f'The type of array becomes:\n{type(arr)}')
print(f'The shape of the array becomes:\n {arr.shape}')
print(f'The size of the array becomes:\n {arr.size}')

The array becomes:
[['Krish' 32 'Banglore']
 ['Jhon' 34 'Banglore']
 ['Bappy' 32 'Banglore']
 ['Jack' 32 'Banglore']]
The type of array becomes:
<class 'numpy.ndarray'>
The shape of the array becomes:
 (4, 3)
The size of the array becomes:
 12


Working with CSV file

lets create a csv file which consists of data regarding the records of a particular tensile test specimen from an experiment

to_csv() : this function is used to convert any dataframe into csv file

In [8]:
data = {
    "Specimen_ID": ["TSP01","TSP02","TSP03","TSP04","TSP05","TSP06","TSP07","TSP08"],
    "Material": ["Steel","Steel","Aluminum","Aluminum","Copper","Copper","Brass","Brass"],
    "Diameter_mm": [8,8,10,10,6,6,8,8],
    "Gauge_Length_mm": [50,50,50,50,50,50,50,50],
    "Original_Area_mm2": [50.27,50.27,78.54,78.54,28.27,28.27,50.27,50.27],
    "Max_Load_N": [42000,41500,28000,27500,18000,17500,32000,31500],
    "UTS_MPa": [835,825,357,350,637,620,637,625],
    "Yield_Stress_MPa": [540,535,250,245,290,280,420,415],
    "Elongation_percent": [18.5,19.2,22.1,23.0,25.3,26.5,21.8,22.4],
    "Test_Date": ["2025-01-12","2025-01-12","2025-01-15","2025-01-15","2025-01-20","2025-01-20","2025-01-25","2025-01-25"]
}
df = pd.DataFrame(data)
df.to_csv('tensile test.csv',index=False)


read_csv() : this function is used to read a particular csv file.

In [9]:
# Now we are going to display the record so we have
record = pd.read_csv("tensile test.csv")
record

Unnamed: 0,Specimen_ID,Material,Diameter_mm,Gauge_Length_mm,Original_Area_mm2,Max_Load_N,UTS_MPa,Yield_Stress_MPa,Elongation_percent,Test_Date
0,TSP01,Steel,8,50,50.27,42000,835,540,18.5,2025-01-12
1,TSP02,Steel,8,50,50.27,41500,825,535,19.2,2025-01-12
2,TSP03,Aluminum,10,50,78.54,28000,357,250,22.1,2025-01-15
3,TSP04,Aluminum,10,50,78.54,27500,350,245,23.0,2025-01-15
4,TSP05,Copper,6,50,28.27,18000,637,290,25.3,2025-01-20
5,TSP06,Copper,6,50,28.27,17500,620,280,26.5,2025-01-20
6,TSP07,Brass,8,50,50.27,32000,637,420,21.8,2025-01-25
7,TSP08,Brass,8,50,50.27,31500,625,415,22.4,2025-01-25


head() : this function is used to display the first five rows of a perticular csv file 

tail() : This function is used to display the last five rows of a particular csv file

In [10]:
# Write a program in python to display first two rows as well as last two rows of a particular record
print(f'The first two rows becomes:')
record.head(2)


The first two rows becomes:


Unnamed: 0,Specimen_ID,Material,Diameter_mm,Gauge_Length_mm,Original_Area_mm2,Max_Load_N,UTS_MPa,Yield_Stress_MPa,Elongation_percent,Test_Date
0,TSP01,Steel,8,50,50.27,42000,835,540,18.5,2025-01-12
1,TSP02,Steel,8,50,50.27,41500,825,535,19.2,2025-01-12


In [11]:
print(f'The last two rows becomes:')
record.tail(2)

The last two rows becomes:


Unnamed: 0,Specimen_ID,Material,Diameter_mm,Gauge_Length_mm,Original_Area_mm2,Max_Load_N,UTS_MPa,Yield_Stress_MPa,Elongation_percent,Test_Date
6,TSP07,Brass,8,50,50.27,32000,637,420,21.8,2025-01-25
7,TSP08,Brass,8,50,50.27,31500,625,415,22.4,2025-01-25


Accessing Data from DataFrame

In [12]:
data = [
    {'Name':'Krish','Age':32,'City':'Banglore'},
    {'Name':'Jhon','Age':34,'City':'Banglore'},
    {'Name':'Bappy','Age':32,'City':'Banglore'},
    {'Name':'Jack','Age':32,'City':'Banglore'}
]
df = pd.DataFrame(data)
print(f'The dataframe becomes:\n {df}')

The dataframe becomes:
     Name  Age      City
0  Krish   32  Banglore
1   Jhon   34  Banglore
2  Bappy   32  Banglore
3   Jack   32  Banglore


In [13]:
# Weite a program in python to access names from the dataframe
df['Name']

0    Krish
1     Jhon
2    Bappy
3     Jack
Name: Name, dtype: object

In [14]:
# Write a program in python to access the data available in first row of the record
df.loc[0]

Name       Krish
Age           32
City    Banglore
Name: 0, dtype: object

In [15]:
# Write a program in python to display the age of the person available in first row of the record
df.iloc[0][1]

32

In [16]:
# Write a program in python to findout the age of the particular person available in second row of the dataframe
print(f"The age of the particular person becomes:\n {df.at[1,'Age']}")

The age of the particular person becomes:
 34


In [17]:
# Write a peogram in python to findout the location of the particular person available in fourth row of the dataframe
print(f"The city of the person becomes:\n {df.iat[3,2]}")

The city of the person becomes:
 Banglore


Data Manipulation

In [18]:
# Write a program in python to add salary column in the given dataframe
df['Salary'] = [50000,60000,70000,80000]
print(f'The dataframe becomes:')
df

The dataframe becomes:


Unnamed: 0,Name,Age,City,Salary
0,Krish,32,Banglore,50000
1,Jhon,34,Banglore,60000
2,Bappy,32,Banglore,70000
3,Jack,32,Banglore,80000


In [19]:
# Write a program in python to remove salary column from the dataframe
df.drop('Salary',axis=1,inplace=True)
print(f'The dataframe becomes:')
df

The dataframe becomes:


Unnamed: 0,Name,Age,City
0,Krish,32,Banglore
1,Jhon,34,Banglore
2,Bappy,32,Banglore
3,Jack,32,Banglore


In [20]:
# Write a program in python to display the dataframe having ages of the person increased by 1
df['Age'] = df['Age'] + 1
print(f'The dataframe becomes:')
df

The dataframe becomes:


Unnamed: 0,Name,Age,City
0,Krish,33,Banglore
1,Jhon,35,Banglore
2,Bappy,33,Banglore
3,Jack,33,Banglore


In [21]:
# Write a program in python to remove the last record from the dataframe
df.drop(3,inplace=True)
print(f'The dataframe becomes:')
df

The dataframe becomes:


Unnamed: 0,Name,Age,City
0,Krish,33,Banglore
1,Jhon,35,Banglore
2,Bappy,33,Banglore


In [22]:
# Write a program in python to know about the datatype of each column of the tensile test data as well it's statistical summary
print(f'The datatype of elements available in each columns becomes:')
record.dtypes

The datatype of elements available in each columns becomes:


Specimen_ID            object
Material               object
Diameter_mm             int64
Gauge_Length_mm         int64
Original_Area_mm2     float64
Max_Load_N              int64
UTS_MPa                 int64
Yield_Stress_MPa        int64
Elongation_percent    float64
Test_Date              object
dtype: object

In [23]:
# Write a program in python to display the statistical summary about the tensile test data
print(f'The statistical summary of tensile test data becomes:')
record.describe()

The statistical summary of tensile test data becomes:


Unnamed: 0,Diameter_mm,Gauge_Length_mm,Original_Area_mm2,Max_Load_N,UTS_MPa,Yield_Stress_MPa,Elongation_percent
count,8.0,8.0,8.0,8.0,8.0,8.0,8.0
mean,8.0,50.0,51.8375,29750.0,610.75,371.875,22.35
std,1.511858,0.0,19.074026,9200.155278,181.358957,122.560116,2.711352
min,6.0,50.0,28.27,17500.0,350.0,245.0,18.5
25%,7.5,50.0,44.77,25125.0,554.25,272.5,21.15
50%,8.0,50.0,50.27,29750.0,631.0,352.5,22.25
75%,8.5,50.0,57.3375,34375.0,684.0,448.75,23.575
max,10.0,50.0,78.54,42000.0,835.0,540.0,26.5


Data Manipulation as well as Data Analysis 

Data manipulation and analysis are key tasks in any data science or data analysis project. Pandas provides a wide range of functions for data manipulation and analysis, making it easier to clean, transform and extract insights from data.

In [24]:
# Now we are going to load our data so we have
data = pd.read_csv('data.csv')
print(f'The dataset becomes:')
data

The dataset becomes:


Unnamed: 0,Date,Category,Value,Product,Sales,Region
0,2023-01-01,A,28.0,Product1,754.0,East
1,2023-01-02,B,39.0,Product3,110.0,North
2,2023-01-03,C,32.0,Product2,398.0,East
3,2023-01-04,B,8.0,Product1,522.0,East
4,2023-01-05,B,26.0,Product3,869.0,North
5,2023-01-06,B,54.0,Product3,192.0,West
6,2023-01-07,A,16.0,Product1,936.0,East
7,2023-01-08,C,89.0,Product1,488.0,West
8,2023-01-09,C,37.0,Product3,772.0,West
9,2023-01-10,A,22.0,Product2,834.0,West


In [25]:
# write a program in python to display first 5 rows of the dataset
print(f'The first five rows becomes:')
data.head()

The first five rows becomes:


Unnamed: 0,Date,Category,Value,Product,Sales,Region
0,2023-01-01,A,28.0,Product1,754.0,East
1,2023-01-02,B,39.0,Product3,110.0,North
2,2023-01-03,C,32.0,Product2,398.0,East
3,2023-01-04,B,8.0,Product1,522.0,East
4,2023-01-05,B,26.0,Product3,869.0,North


In [27]:
# Write a program in python to findout all the statistical analysis about the dataset so we have
print(f'The statistical analysis becomes:\n')
data.describe()

The statistical analysis becomes:



Unnamed: 0,Value,Sales
count,47.0,46.0
mean,51.744681,557.130435
std,29.050532,274.598584
min,2.0,108.0
25%,27.5,339.0
50%,54.0,591.5
75%,70.0,767.5
max,99.0,992.0


In [30]:
# Write a program in python to see the datatype of entries present in eaCh column so we have
print(f'The datatype of each entries beComes:')
data.dtypes

The datatype of each entries beComes:


Date         object
Category     object
Value       float64
Product      object
Sales       float64
Region       object
dtype: object

Handling the Missing Value

In [None]:
# Write a program in python to findout the missing values present in the dataset so we have
data.isnull().any()

Date        False
Category    False
Value        True
Product     False
Sales        True
Region      False
dtype: bool

In [33]:
# Write a program in python to findout the total number of missing value
data.isnull().sum()

Date        0
Category    0
Value       3
Product     0
Sales       4
Region      0
dtype: int64

In [34]:
# Write a program in python to replace all the missing value in the data with 0
data_fillna = data.fillna(0)
data_fillna

Unnamed: 0,Date,Category,Value,Product,Sales,Region
0,2023-01-01,A,28.0,Product1,754.0,East
1,2023-01-02,B,39.0,Product3,110.0,North
2,2023-01-03,C,32.0,Product2,398.0,East
3,2023-01-04,B,8.0,Product1,522.0,East
4,2023-01-05,B,26.0,Product3,869.0,North
5,2023-01-06,B,54.0,Product3,192.0,West
6,2023-01-07,A,16.0,Product1,936.0,East
7,2023-01-08,C,89.0,Product1,488.0,West
8,2023-01-09,C,37.0,Product3,772.0,West
9,2023-01-10,A,22.0,Product2,834.0,West


In [35]:
# Write a program in python to replace all the missing value with the mean of that particular column
data['Sales'] = data['Sales'].fillna(data['Sales'].mean())
data

Unnamed: 0,Date,Category,Value,Product,Sales,Region
0,2023-01-01,A,28.0,Product1,754.0,East
1,2023-01-02,B,39.0,Product3,110.0,North
2,2023-01-03,C,32.0,Product2,398.0,East
3,2023-01-04,B,8.0,Product1,522.0,East
4,2023-01-05,B,26.0,Product3,869.0,North
5,2023-01-06,B,54.0,Product3,192.0,West
6,2023-01-07,A,16.0,Product1,936.0,East
7,2023-01-08,C,89.0,Product1,488.0,West
8,2023-01-09,C,37.0,Product3,772.0,West
9,2023-01-10,A,22.0,Product2,834.0,West


In [37]:
# Write a program in python to change the name of the date column
data = data.rename(columns={'Date':'Sales Date'})
data.head()

Unnamed: 0,Sales Date,Category,Value,Product,Sales,Region
0,2023-01-01,A,28.0,Product1,754.0,East
1,2023-01-02,B,39.0,Product3,110.0,North
2,2023-01-03,C,32.0,Product2,398.0,East
3,2023-01-04,B,8.0,Product1,522.0,East
4,2023-01-05,B,26.0,Product3,869.0,North


In [45]:
# Write a program in python to change the datatype of the particular column in the dataframe
data['Value_new'] = data['Value'].fillna(data['Value'].mean()).astype(int)
data.head()

Unnamed: 0,Sales Date,Category,Value,Product,Sales,Region,Value_new
0,2023-01-01,A,28.0,Product1,754.0,East,28
1,2023-01-02,B,39.0,Product3,110.0,North,39
2,2023-01-03,C,32.0,Product2,398.0,East,32
3,2023-01-04,B,8.0,Product1,522.0,East,8
4,2023-01-05,B,26.0,Product3,869.0,North,26


In [47]:
# Write a program in python if the sales value got increased by 2 times
data['Value_new'] = data['Value_new'].apply(lambda x:x*2)
data.head()

Unnamed: 0,Sales Date,Category,Value,Product,Sales,Region,Value_new
0,2023-01-01,A,28.0,Product1,754.0,East,112
1,2023-01-02,B,39.0,Product3,110.0,North,156
2,2023-01-03,C,32.0,Product2,398.0,East,128
3,2023-01-04,B,8.0,Product1,522.0,East,32
4,2023-01-05,B,26.0,Product3,869.0,North,104


Data Grouping and Aggregation

In [48]:
# Write a program in python to group elements based on the mean
grouped_mean = data.groupby('Product')['Value'].mean()
print(f'The grouping of products becomes:\n {grouped_mean}')

The grouping of products becomes:
 Product
Product1    46.214286
Product2    52.800000
Product3    55.166667
Name: Value, dtype: float64


In [49]:
# Write a program in python to find out the total sales based on the region
grouped_sum = data.groupby(['Product','Region'])['Value'].sum()
print(f'The region wise sales of the particular product becomes:\n {grouped_sum}')

The region wise sales of the particular product becomes:
 Product   Region
Product1  East      292.0
          North       9.0
          South     100.0
          West      246.0
Product2  East       56.0
          North     127.0
          South     181.0
          West      428.0
Product3  East      202.0
          North     203.0
          South     215.0
          West      373.0
Name: Value, dtype: float64


In [51]:
# Write a program in python  to findout the mean of the sales based on the region
grouped_mean = data.groupby(['Product','Region'])['Value'].mean()
print(f'The region wise sales of the particular product becomes:\n {grouped_mean}')

The region wise sales of the particular product becomes:
 Product   Region
Product1  East      41.714286
          North      4.500000
          South     50.000000
          West      82.000000
Product2  East      28.000000
          North     63.500000
          South     60.333333
          West      53.500000
Product3  East      50.500000
          North     40.600000
          South     71.666667
          West      62.166667
Name: Value, dtype: float64


In [53]:
# Write a program in python to display total sales, average sales as well as number of sales in particular region
data_sales = data.groupby(['Region'])['Value'].agg(['mean','sum','count'])
print(f'The particular sales becomes:')
data_sales

The particular sales becomes:


Unnamed: 0_level_0,mean,sum,count
Region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
East,42.307692,550.0,13
North,37.666667,339.0,9
South,62.0,496.0,8
West,61.588235,1047.0,17


Merging and Joining of DataFrames

In [61]:
# Write a program in python to create two different data frames and join them in different parameters
a = pd.DataFrame({'key':['A','B','C'], 'Values':[1,2,3]})
b = pd.DataFrame({'Key':['A','B','D'], 'Values':[4,5,6]})
print(f'The first dataframe becomes:')
a

The first dataframe becomes:


Unnamed: 0,key,Values
0,A,1
1,B,2
2,C,3


In [62]:
print(f'The second dataframe becomes:')
b

The second dataframe becomes:


Unnamed: 0,Key,Values
0,A,4
1,B,5
2,D,6


In [64]:
# Write a program in python to join the two dataframe based on keys
a.columns = a.columns.str.lower()
b.columns = b.columns.str.lower()
df = pd.merge(a,b,on='key', how='inner')
print(f'The dataframe becomes:\n ')
df

The dataframe becomes:
 


Unnamed: 0,key,values_x,values_y
0,A,1,4
1,B,2,5
