# üêº What is Pandas?

Pandas is an open-source Python library used for data manipulation and analysis. It provides rich data structures and functions that allow efficient operations on datasets, especially tabular data like spreadsheets or SQL tables.

It is built on top of the NumPy library, making complex data manipulation tasks easier and more intuitive.

# üîπ Why Use Pandas?

Well-suited for tabular data

Fast and efficient for data manipulation

Plays nicely with other data science libraries

Simplifies tasks like cleaning, transforming, visualizing, and analyzing data

# üîó Integration with Other Libraries Pandas works seamlessly with:

NumPy ‚Äì numerical operations

Matplotlib ‚Äì plotting graphs

Seaborn ‚Äì advanced data visualization

SciPy ‚Äì statistical analysis

Scikit-learn ‚Äì machine learning algorithms


# Tasks in Pandas ‚Äì

## 1. Data Cleaning, Merging, and Joining

Used to handle and unify data from multiple sources

Helps in removing duplicates and inconsistencies

Functions: merge(), concat(), drop_duplicates()

## 2. Handling Missing Data

Manage missing or null values (NaN) in datasets

Options include filling with a value or dropping the data

Functions: fillna(), dropna(), isnull()

## 3. Column Insertion and Deletion

Add new columns based on operations or conditions

Remove unnecessary columns from a DataFrame

Functions: df['new_col'] = ..., drop(), insert()

## 4. GroupBy Operations

Group data based on a key (e.g., department, category)

Apply aggregation functions like mean(), sum(), count()

Technique: ‚Äúsplit-apply-combine‚Äù

Function: groupby()

## 5. Data Visualization

Plot data directly from Pandas using Matplotlib or Seaborn

Common plots: line, bar, histogram, boxplot

Functions: plot(), hist(), bar()

# üìä What is Data Analysis?

Data Analysis is the process of inspecting, cleaning, transforming, and modeling data to discover useful information, draw conclusions, and support decision-making.

# üîç Main Goals of Data Analysis:

Understand patterns and trends in data

Make informed decisions based on evidence

Predict outcomes using statistical and machine learning models

Communicate results clearly through visualizations

# Steps in Data Analysis:

| üß© **Step**              | üìù **Description**                                                                 |
|-------------------------|------------------------------------------------------------------------------------|
| 1. **Data Collection**   | Gather data from various sources (CSV, Excel, database, APIs, etc.)               |
| 2. **Data Cleaning**     | Handle missing values, remove duplicates, correct errors                          |
| 3. **Data Exploration**  | Use statistical summaries and visualizations to understand the dataset            |
| 4. **Data Transformation** | Normalize, encode, or restructure data for analysis                           |
| 5. **Data Modeling**     | Apply algorithms or statistical models to find patterns                           |
| 6. **Interpretation**    | Draw conclusions and create reports, dashboards, or predictions                   |


# Tools Used in Data Analysis:

Python (with Pandas, NumPy, Matplotlib, Seaborn)

Excel

SQL

Power BI / Tableau

# Example Use Cases:

Analyzing customer buying behavior

Predicting stock market trends

Detecting fraud in financial transactions



In [271]:
# install pandas - pip install pandas
import pandas as pd

#  Pandas Data Structures

# 1.  Series :

A one-dimensional labeled array capable of holding any data type (int, float, string, etc.).

Similar to a column in Excel or a single list with labels (index).

In [272]:
import pandas as pd
# pass the list , tuple , dictionary
l = [1,2,3,4,5,6]
a =  pd.Series(l , index=['a','b','c','d','e','f'], dtype='float', name='python')
# change the index here. default index is numerical start from 0.
# also change data type.
print(a)
print()
print(type(a))
print()
print(a[3])

a    1.0
b    2.0
c    3.0
d    4.0
e    5.0
f    6.0
Name: python, dtype: float64

<class 'pandas.core.series.Series'>

4.0


  print(a[3])


In [273]:
dic = {
    "name":["python","cpp","java","javascript"],
    "popularity":[10,9,8,7],
    "rank":[1,2,3,4]
}
a = pd.Series(dic)
print(a)


name          [python, cpp, java, javascript]
popularity                      [10, 9, 8, 7]
rank                             [1, 2, 3, 4]
dtype: object


In [274]:
s = pd.Series(10,index=[1,2,3,4,5])
print(s)
print()
print(type(s))

1    10
2    10
3    10
4    10
5    10
dtype: int64

<class 'pandas.core.series.Series'>


In [275]:
s1 = pd.Series(12 , index=[1,2,3,4,5,6,7])
s2 = pd.Series(12 , index=[1,2,3,4])
print(s1+s2)

1    24.0
2    24.0
3    24.0
4    24.0
5     NaN
6     NaN
7     NaN
dtype: float64


# 2. DataFrame
A two-dimensional labeled data structure, like an Excel sheet or SQL table.

Consists of rows and columns, where each column is a Series.

In [276]:
l = [1,2,4,5,6]
df = pd.DataFrame(l)
print(df)
print()
print(type(df))

   0
0  1
1  2
2  4
3  5
4  6

<class 'pandas.core.frame.DataFrame'>


In [277]:
# length of values same then work otherwise give error.
d ={"a" : [1,2,3,4,5],"b" : [6,7,8,9,10],"d":[1,2,3,4,5]}
df = pd.DataFrame(d)
print(df)
print()
print(type(df))

   a   b  d
0  1   6  1
1  2   7  2
2  3   8  3
3  4   9  4
4  5  10  5

<class 'pandas.core.frame.DataFrame'>


In [278]:
#access only required columns
df = pd.DataFrame(d , columns=["a","d"], index=["a","b","c","d","e"]) #change the index values.
print(df)
print()
#get particular data
#dataframe_name["Column_name"]["row_index"]
print(df["a"])
print()
print(df["a"]["b"])

   a  d
a  1  1
b  2  2
c  3  3
d  4  4
e  5  5

a    1
b    2
c    3
d    4
e    5
Name: a, dtype: int64

2


In [279]:
list_1 = [[1,2,3,4,5],[11,12,13,14,15]]
df = pd.DataFrame(list_1)
print(df)
print()
print(type(df))

    0   1   2   3   4
0   1   2   3   4   5
1  11  12  13  14  15

<class 'pandas.core.frame.DataFrame'>


In [280]:
sr = {"s":pd.Series([1,2,3,4,5]),"r":pd.Series([11,12,13,14,15])}
df = pd.DataFrame(sr)
print(df)
print()

   s   r
0  1  11
1  2  12
2  3  13
3  4  14
4  5  15



# Airthmetic Operations in Pandas

In [281]:
import pandas as pd
df = pd.DataFrame({
    "a":[1,2,3,4,5],
    "b":[6,7,8,9,10],
    "c":[11,12,13,1,2]})
print(df)


   a   b   c
0  1   6  11
1  2   7  12
2  3   8  13
3  4   9   1
4  5  10   2


In [282]:
print(df.head(3))


   a  b   c
0  1  6  11
1  2  7  12
2  3  8  13


In [283]:
print(df.tail(2))

   a   b  c
3  4   9  1
4  5  10  2


In [284]:
df["d"] = df["a"] + df["b"]
print(df)

   a   b   c   d
0  1   6  11   7
1  2   7  12   9
2  3   8  13  11
3  4   9   1  13
4  5  10   2  15


In [285]:
df["d"] = df["a"] - df["b"]
print(df)

   a   b   c  d
0  1   6  11 -5
1  2   7  12 -5
2  3   8  13 -5
3  4   9   1 -5
4  5  10   2 -5


In [286]:
df["d"] = df["a"] * df["b"]
df

Unnamed: 0,a,b,c,d
0,1,6,11,6
1,2,7,12,14
2,3,8,13,24
3,4,9,1,36
4,5,10,2,50


In [287]:
df["d"] = df["a"] / df["b"]
df

Unnamed: 0,a,b,c,d
0,1,6,11,0.166667
1,2,7,12,0.285714
2,3,8,13,0.375
3,4,9,1,0.444444
4,5,10,2,0.5


In [288]:
df1 = pd.DataFrame({
    "a":[1,2,3,4,5],
    "b":[6,7,8,9,10],
    "c":[11,12,13,1,2]})
df1

Unnamed: 0,a,b,c
0,1,6,11
1,2,7,12
2,3,8,13
3,4,9,1
4,5,10,2


In [289]:
#apply conditions
df1["python"] = df1["a"] <= 3
df1

Unnamed: 0,a,b,c,python
0,1,6,11,True
1,2,7,12,True
2,3,8,13,True
3,4,9,1,False
4,5,10,2,False


In [290]:
df1['python_1' ] = df1["b"] >= 8
df1

Unnamed: 0,a,b,c,python,python_1
0,1,6,11,True,False
1,2,7,12,True,False
2,3,8,13,True,True
3,4,9,1,False,True
4,5,10,2,False,True


In [291]:
df2 = pd.DataFrame({
    "a":[1,2,3,4,5],
    "b":[6,7,8,9,10],
    "c":[11,12,13,1,2]})
df2

Unnamed: 0,a,b,c
0,1,6,11
1,2,7,12
2,3,8,13
3,4,9,1
4,5,10,2


In [292]:
#airthmetic operations
print(df2.sum())
print()
print(df2.sum(axis=1))

a    15
b    40
c    39
dtype: int64

0    18
1    21
2    24
3    14
4    17
dtype: int64


In [293]:
print(df2.mean())
print()
print(df2.median())

a    3.0
b    8.0
c    7.8
dtype: float64

a     3.0
b     8.0
c    11.0
dtype: float64


In [294]:
print(df2.min())
print()
print(df2.max())


a    1
b    6
c    1
dtype: int64

a     5
b    10
c    13
dtype: int64


In [295]:
print(df2.std())
print()
print(df2.count())
print()
print(df2.describe())

a    1.581139
b    1.581139
c    5.805170
dtype: float64

a    5
b    5
c    5
dtype: int64

              a          b         c
count  5.000000   5.000000   5.00000
mean   3.000000   8.000000   7.80000
std    1.581139   1.581139   5.80517
min    1.000000   6.000000   1.00000
25%    2.000000   7.000000   2.00000
50%    3.000000   8.000000  11.00000
75%    4.000000   9.000000  12.00000
max    5.000000  10.000000  13.00000


# Delete And Insert Data in Pandas



In [296]:
df = pd.DataFrame({
    "a":[1,2,3,4,5],
    "b":[6,7,8,9,10],
    "c":[11,12,13,1,2]})
df

Unnamed: 0,a,b,c
0,1,6,11
1,2,7,12
2,3,8,13
3,4,9,1
4,5,10,2


In [297]:
#syntax
#dataframe_name.insert(index_position, column_name, data)
#data - is equivalent to dataframe data . otherwise gives error.
df.insert(1,"python",df["a"])
df

Unnamed: 0,a,python,b,c
0,1,1,6,11
1,2,2,7,12
2,3,3,8,13
3,4,4,9,1
4,5,5,10,2


In [298]:
df["python_12"] = df["b"][:3]
df

Unnamed: 0,a,python,b,c,python_12
0,1,1,6,11,6.0
1,2,2,7,12,7.0
2,3,3,8,13,8.0
3,4,4,9,1,
4,5,5,10,2,


In [299]:
#delete
#dataframe_name.pop("colmun_name")
var1 = df.pop("python_12")
var1

Unnamed: 0,python_12
0,6.0
1,7.0
2,8.0
3,
4,


In [300]:
del df["python"]
df

Unnamed: 0,a,b,c
0,1,6,11
1,2,7,12
2,3,8,13
3,4,9,1
4,5,10,2


# create csv file
csv - comma separated values

In [301]:
dataframe = pd.DataFrame({
    "a":[1,2,3,4,5],
    "b":[6,7,8,9,10],
    "c":[11,12,13,1,2]})
dataframe

Unnamed: 0,a,b,c
0,1,6,11
1,2,7,12
2,3,8,13
3,4,9,1
4,5,10,2


In [302]:
#create csv file
#syntax : dataframe_name.to_csv("file_name.csv")
dataframe.to_csv("file.csv")

In [303]:
#read csv file
dataframe1  = pd.read_csv("stock_data.csv")
dataframe1

Unnamed: 0,symbol,security,today.volume,avg.volume,change
0,GEP,Company_WWXX,479025,495869,-5.02
1,YYO,Company_YVIP,301028,646886,6.55
2,WRA,Company_DUZ0,162025,486664,-0.42
3,REA,Company_A37E,319648,764607,-4.2
4,YDB,Company_L633,722455,83500,5.88
5,URH,Company_ELH9,744998,245825,5.67
6,LHU,Company_36OV,704146,321342,1.09
7,XIK,Company_32QS,964679,190200,-7.54
8,UMA,Company_YNY5,465866,134376,-1.26
9,ZAM,Company_JTBN,43883,333005,-1.32


In [304]:
#access required number of rows
dataframe2  = pd.read_csv("stock_data.csv", nrows=3)
dataframe2

Unnamed: 0,symbol,security,today.volume,avg.volume,change
0,GEP,Company_WWXX,479025,495869,-5.02
1,YYO,Company_YVIP,301028,646886,6.55
2,WRA,Company_DUZ0,162025,486664,-0.42


In [305]:
print(type(dataframe1))

<class 'pandas.core.frame.DataFrame'>


In [306]:
#access particular column
# column also access using index number. index start from 0
dataframe2  = pd.read_csv("stock_data.csv", usecols=['symbol','change'])
dataframe2.head()

Unnamed: 0,symbol,change
0,GEP,-5.02
1,YYO,6.55
2,WRA,-0.42
3,REA,-4.2
4,YDB,5.88


In [307]:
# column also access using index number. index start from 0
dataframe3  = pd.read_csv("stock_data.csv", usecols=[1 ,3])
dataframe3.head(4)

Unnamed: 0,security,avg.volume
0,Company_WWXX,495869
1,Company_YVIP,646886
2,Company_DUZ0,486664
3,Company_A37E,764607


In [308]:
#skip the row
#skiprows=[row_index]
dataframe4  = pd.read_csv("stock_data.csv", skiprows=[1,5])
dataframe4.head(10)

Unnamed: 0,symbol,security,today.volume,avg.volume,change
0,YYO,Company_YVIP,301028,646886,6.55
1,WRA,Company_DUZ0,162025,486664,-0.42
2,REA,Company_A37E,319648,764607,-4.2
3,URH,Company_ELH9,744998,245825,5.67
4,LHU,Company_36OV,704146,321342,1.09
5,XIK,Company_32QS,964679,190200,-7.54
6,UMA,Company_YNY5,465866,134376,-1.26
7,ZAM,Company_JTBN,43883,333005,-1.32
8,FOZ,Company_L85Q,485442,122378,0.01
9,ZKW,Company_OUXF,982011,188490,-0.97


In [309]:
#index_col="column_name"
dataframe5  = pd.read_csv("stock_data.csv", index_col="symbol")
dataframe5.head()


Unnamed: 0_level_0,security,today.volume,avg.volume,change
symbol,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
GEP,Company_WWXX,479025,495869,-5.02
YYO,Company_YVIP,301028,646886,6.55
WRA,Company_DUZ0,162025,486664,-0.42
REA,Company_A37E,319648,764607,-4.2
YDB,Company_L633,722455,83500,5.88


In [310]:
#header
#header = row_index
dataframe6  = pd.read_csv("stock_data.csv", header=2)
dataframe6.head()

Unnamed: 0,YYO,Company_YVIP,301028,646886,6.55
0,WRA,Company_DUZ0,162025,486664,-0.42
1,REA,Company_A37E,319648,764607,-4.2
2,YDB,Company_L633,722455,83500,5.88
3,URH,Company_ELH9,744998,245825,5.67
4,LHU,Company_36OV,704146,321342,1.09


In [311]:
#names parameter
dataframe7  = pd.read_csv("stock_data.csv", names=['col_1','col_2','col_3','col_4','col_5'])
dataframe7.head()

Unnamed: 0,col_1,col_2,col_3,col_4,col_5
0,symbol,security,today.volume,avg.volume,change
1,GEP,Company_WWXX,479025,495869,-5.02
2,YYO,Company_YVIP,301028,646886,6.55
3,WRA,Company_DUZ0,162025,486664,-0.42
4,REA,Company_A37E,319648,764607,-4.2


In [312]:
# header = none
dataframe8  = pd.read_csv("stock_data.csv", header=None)
dataframe8.head()

Unnamed: 0,0,1,2,3,4
0,symbol,security,today.volume,avg.volume,change
1,GEP,Company_WWXX,479025,495869,-5.02
2,YYO,Company_YVIP,301028,646886,6.55
3,WRA,Company_DUZ0,162025,486664,-0.42
4,REA,Company_A37E,319648,764607,-4.2


In [313]:
# prefix="col": Assigns column names as col0, col1, ..., coln.
# dataframe9  = pd.read_csv("stock_data.csv", header=None, prefix="col")
# dataframe9.head()

In [314]:
#change column data type
dataframe10  = pd.read_csv("stock_data.csv", dtype={'today.volume': 'float64'})
dataframe10.head()


Unnamed: 0,symbol,security,today.volume,avg.volume,change
0,GEP,Company_WWXX,479025.0,495869,-5.02
1,YYO,Company_YVIP,301028.0,646886,6.55
2,WRA,Company_DUZ0,162025.0,486664,-0.42
3,REA,Company_A37E,319648.0,764607,-4.2
4,YDB,Company_L633,722455.0,83500,5.88


# Pandas Functions

In [315]:
import pandas as pd

dataframe = pd.read_csv("stock_data.csv")
dataframe.head()

Unnamed: 0,symbol,security,today.volume,avg.volume,change
0,GEP,Company_WWXX,479025,495869,-5.02
1,YYO,Company_YVIP,301028,646886,6.55
2,WRA,Company_DUZ0,162025,486664,-0.42
3,REA,Company_A37E,319648,764607,-4.2
4,YDB,Company_L633,722455,83500,5.88


In [316]:
dataframe.index

RangeIndex(start=0, stop=26, step=1)

In [317]:
# get column name
dataframe.columns

Index(['symbol', 'security', 'today.volume', 'avg.volume', 'change'], dtype='object')

In [318]:
dataframe.describe()

Unnamed: 0,today.volume,avg.volume,change
count,26.0,26.0,26.0
mean,496524.5,396419.384615,0.845769
std,300547.256593,242832.029994,5.467192
min,25617.0,37056.0,-8.6
25%,260127.5,188917.5,-3.31
50%,472445.5,373864.5,-0.205
75%,737386.0,561754.25,6.1075
max,999105.0,881645.0,8.31


In [319]:
dataframe.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26 entries, 0 to 25
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   symbol        26 non-null     object 
 1   security      26 non-null     object 
 2   today.volume  26 non-null     int64  
 3   avg.volume    26 non-null     int64  
 4   change        26 non-null     float64
dtypes: float64(1), int64(2), object(2)
memory usage: 1.1+ KB


In [320]:
dataframe.head(3)

Unnamed: 0,symbol,security,today.volume,avg.volume,change
0,GEP,Company_WWXX,479025,495869,-5.02
1,YYO,Company_YVIP,301028,646886,6.55
2,WRA,Company_DUZ0,162025,486664,-0.42


In [321]:
dataframe.tail(3)

Unnamed: 0,symbol,security,today.volume,avg.volume,change
23,CXK,Company_4DHU,742363,247012,6.17
24,MNZ,Company_P56S,747011,881645,6.52
25,FUW,Company_KUU3,161884,567346,-3.28


In [322]:
dataframe[:2]

Unnamed: 0,symbol,security,today.volume,avg.volume,change
0,GEP,Company_WWXX,479025,495869,-5.02
1,YYO,Company_YVIP,301028,646886,6.55


In [323]:
dataframe[ :10:2]

Unnamed: 0,symbol,security,today.volume,avg.volume,change
0,GEP,Company_WWXX,479025,495869,-5.02
2,WRA,Company_DUZ0,162025,486664,-0.42
4,YDB,Company_L633,722455,83500,5.88
6,LHU,Company_36OV,704146,321342,1.09
8,UMA,Company_YNY5,465866,134376,-1.26


In [324]:
print(type(dataframe))

<class 'pandas.core.frame.DataFrame'>


In [325]:
import pandas as pd

dataframe = pd.read_csv("stock_data.csv")
dataframe.head()

Unnamed: 0,symbol,security,today.volume,avg.volume,change
0,GEP,Company_WWXX,479025,495869,-5.02
1,YYO,Company_YVIP,301028,646886,6.55
2,WRA,Company_DUZ0,162025,486664,-0.42
3,REA,Company_A37E,319648,764607,-4.2
4,YDB,Company_L633,722455,83500,5.88


In [326]:
#convert into array
dataframe.index.array

<NumpyExtensionArray>
[ np.int64(0),  np.int64(1),  np.int64(2),  np.int64(3),  np.int64(4),
  np.int64(5),  np.int64(6),  np.int64(7),  np.int64(8),  np.int64(9),
 np.int64(10), np.int64(11), np.int64(12), np.int64(13), np.int64(14),
 np.int64(15), np.int64(16), np.int64(17), np.int64(18), np.int64(19),
 np.int64(20), np.int64(21), np.int64(22), np.int64(23), np.int64(24),
 np.int64(25)]
Length: 26, dtype: int64

In [327]:
#convert to numpy array
dataframe.to_numpy()

array([['GEP', 'Company_WWXX', 479025, 495869, -5.02],
       ['YYO', 'Company_YVIP', 301028, 646886, 6.55],
       ['WRA', 'Company_DUZ0', 162025, 486664, -0.42],
       ['REA', 'Company_A37E', 319648, 764607, -4.2],
       ['YDB', 'Company_L633', 722455, 83500, 5.88],
       ['URH', 'Company_ELH9', 744998, 245825, 5.67],
       ['LHU', 'Company_36OV', 704146, 321342, 1.09],
       ['XIK', 'Company_32QS', 964679, 190200, -7.54],
       ['UMA', 'Company_YNY5', 465866, 134376, -1.26],
       ['ZAM', 'Company_JTBN', 43883, 333005, -1.32],
       ['FOZ', 'Company_L85Q', 485442, 122378, 0.01],
       ['ZKW', 'Company_OUXF', 982011, 188490, -0.97],
       ['YQV', 'Company_9M9P', 586655, 143374, 5.92],
       ['OTD', 'Company_JOPR', 149463, 414724, 4.24],
       ['KJF', 'Company_F2RR', 246494, 712221, -2.59],
       ['RRX', 'Company_W5E3', 25617, 448414, -3.32],
       ['ANP', 'Company_11G3', 975490, 281419, 7.69],
       ['SSH', 'Company_SXOW', 416377, 37056, 8.31],
       ['HCQ', 'Company_

In [328]:
#convert to numpy array
import numpy as np
v = np.asarray(dataframe)
print(v)

[['GEP' 'Company_WWXX' 479025 495869 -5.02]
 ['YYO' 'Company_YVIP' 301028 646886 6.55]
 ['WRA' 'Company_DUZ0' 162025 486664 -0.42]
 ['REA' 'Company_A37E' 319648 764607 -4.2]
 ['YDB' 'Company_L633' 722455 83500 5.88]
 ['URH' 'Company_ELH9' 744998 245825 5.67]
 ['LHU' 'Company_36OV' 704146 321342 1.09]
 ['XIK' 'Company_32QS' 964679 190200 -7.54]
 ['UMA' 'Company_YNY5' 465866 134376 -1.26]
 ['ZAM' 'Company_JTBN' 43883 333005 -1.32]
 ['FOZ' 'Company_L85Q' 485442 122378 0.01]
 ['ZKW' 'Company_OUXF' 982011 188490 -0.97]
 ['YQV' 'Company_9M9P' 586655 143374 5.92]
 ['OTD' 'Company_JOPR' 149463 414724 4.24]
 ['KJF' 'Company_F2RR' 246494 712221 -2.59]
 ['RRX' 'Company_W5E3' 25617 448414 -3.32]
 ['ANP' 'Company_11G3' 975490 281419 7.69]
 ['SSH' 'Company_SXOW' 416377 37056 8.31]
 ['HCQ' 'Company_1VTO' 355190 802581 7.56]
 ['KZO' 'Company_CSVK' 179963 129910 -5.92]
 ['ECE' 'Company_MV67' 507731 512063 -8.6]
 ['TPC' 'Company_QKR9' 441088 544979 -6.45]
 ['NYH' 'Company_K20Q' 999105 571018 7.27]
 ['CX

In [329]:
#axis = 0 -> work with row
#axis = 1 -> work with column
dataframe.sort_index(axis=0, ascending=False)

Unnamed: 0,symbol,security,today.volume,avg.volume,change
25,FUW,Company_KUU3,161884,567346,-3.28
24,MNZ,Company_P56S,747011,881645,6.52
23,CXK,Company_4DHU,742363,247012,6.17
22,NYH,Company_K20Q,999105,571018,7.27
21,TPC,Company_QKR9,441088,544979,-6.45
20,ECE,Company_MV67,507731,512063,-8.6
19,KZO,Company_CSVK,179963,129910,-5.92
18,HCQ,Company_1VTO,355190,802581,7.56
17,SSH,Company_SXOW,416377,37056,8.31
16,ANP,Company_11G3,975490,281419,7.69


In [330]:
#change data
#syntax = dataframe_name["column_name"][row_number] = "changed_value"
dataframe["symbol"][0] = "python"
dataframe.head()

You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.
A typical example is when you are setting values in a column of a DataFrame, like:

df["col"][row_indexer] = value

Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

  dataframe["symbol"][0] = "python"
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dataframe["symbol"][0] = "pyt

Unnamed: 0,symbol,security,today.volume,avg.volume,change
0,python,Company_WWXX,479025,495869,-5.02
1,YYO,Company_YVIP,301028,646886,6.55
2,WRA,Company_DUZ0,162025,486664,-0.42
3,REA,Company_A37E,319648,764607,-4.2
4,YDB,Company_L633,722455,83500,5.88


In [331]:
#best method
#syntax dataframe_name.loc[row_number,"column_name"] = "changed_value"
dataframe.loc[1,"symbol"] = "python"
dataframe.head()

Unnamed: 0,symbol,security,today.volume,avg.volume,change
0,python,Company_WWXX,479025,495869,-5.02
1,python,Company_YVIP,301028,646886,6.55
2,WRA,Company_DUZ0,162025,486664,-0.42
3,REA,Company_A37E,319648,764607,-4.2
4,YDB,Company_L633,722455,83500,5.88


In [332]:
#get particular data using loc
dataframe.loc[[1,2,5],["symbol","today.volume"]]

Unnamed: 0,symbol,today.volume
1,python,301028
2,WRA,162025
5,URH,744998


In [333]:
dataframe.loc[[1,2,5],:]

Unnamed: 0,symbol,security,today.volume,avg.volume,change
1,python,Company_YVIP,301028,646886,6.55
2,WRA,Company_DUZ0,162025,486664,-0.42
5,URH,Company_ELH9,744998,245825,5.67


In [334]:
dataframe.loc[:,["symbol","today.volume"]]

Unnamed: 0,symbol,today.volume
0,python,479025
1,python,301028
2,WRA,162025
3,REA,319648
4,YDB,722455
5,URH,744998
6,LHU,704146
7,XIK,964679
8,UMA,465866
9,ZAM,43883


In [335]:
#access particular position of data
dataframe.iloc[0,1]
dataframe.iloc[1,3]

np.int64(646886)

# drop() function

In [336]:
import pandas as pd
dataframe = pd.read_csv("stock_data.csv")
data = dataframe.drop('security', axis=1)
data

Unnamed: 0,symbol,today.volume,avg.volume,change
0,GEP,479025,495869,-5.02
1,YYO,301028,646886,6.55
2,WRA,162025,486664,-0.42
3,REA,319648,764607,-4.2
4,YDB,722455,83500,5.88
5,URH,744998,245825,5.67
6,LHU,704146,321342,1.09
7,XIK,964679,190200,-7.54
8,UMA,465866,134376,-1.26
9,ZAM,43883,333005,-1.32


In [337]:
data1 = dataframe.drop(2, axis=0)
data1

Unnamed: 0,symbol,security,today.volume,avg.volume,change
0,GEP,Company_WWXX,479025,495869,-5.02
1,YYO,Company_YVIP,301028,646886,6.55
3,REA,Company_A37E,319648,764607,-4.2
4,YDB,Company_L633,722455,83500,5.88
5,URH,Company_ELH9,744998,245825,5.67
6,LHU,Company_36OV,704146,321342,1.09
7,XIK,Company_32QS,964679,190200,-7.54
8,UMA,Company_YNY5,465866,134376,-1.26
9,ZAM,Company_JTBN,43883,333005,-1.32
10,FOZ,Company_L85Q,485442,122378,0.01


# dropna() - Remove Missing Data

### üìã `dropna()` Parameters

| **Parameter** | **Type**        | **Default** | **Description** |
|---------------|------------------|-------------|------------------|
| `axis`        | int or str       | `0`         | Axis along which to drop missing values:<br>`0` = rows, `1` = columns |
| `how`         | str              | `'any'`     | `'any'` drops rows/columns with **any** NaN<br>`'all'` drops only if **all** values are NaN |
| `thresh`      | int              | `None`      | Require a minimum number of **non-NaN** values to retain the row/column |
| `subset`      | list-like        | `None`      | Labels along the axis to consider for detecting missing values |
| `inplace`     | bool             | `False`     | Modify the DataFrame **in place** (without returning a new object) |


In [338]:
import pandas as pd

dataframe = pd.read_csv("car.csv")
dataframe.head()

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Honda,White,35431.0,4.0,15323.0
1,BMW,Blue,192714.0,5.0,19943.0
2,Honda,White,84714.0,4.0,28343.0
3,Toyota,White,154365.0,4.0,13434.0
4,Nissan,Blue,181577.0,3.0,14043.0


In [339]:
dataframe.shape

(1000, 5)

In [340]:
dataframe.isnull()

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,False,False,False,False,False
1,False,False,False,False,False
2,False,False,False,False,False
3,False,False,False,False,False
4,False,False,False,False,False
...,...,...,...,...,...
995,False,False,False,False,False
996,True,False,False,False,False
997,False,False,False,False,False
998,False,False,False,False,False


In [341]:
dataframe.isnull().sum()

Unnamed: 0,0
Make,49
Colour,50
Odometer (KM),50
Doors,50
Price,50


In [342]:
dataframe1 = dataframe.dropna()
dataframe1.shape

(773, 5)

In [343]:
#delete the column
dataframe2 = dataframe.dropna(axis=1)
dataframe2.shape

(1000, 0)

In [344]:
dataframe3 = dataframe.dropna(how='all')
dataframe3.shape

(1000, 5)

In [345]:
dataframe4 = dataframe.dropna(how="any")
dataframe4.shape

(773, 5)

In [346]:
dataframe5 = dataframe.dropna(thresh=1)
dataframe5.shape

(1000, 5)

In [347]:
# subset - work along column
dataframe6 = dataframe.dropna(subset=['Make'])
dataframe6.shape

(951, 5)

In [348]:
#inplace
dataframe.dropna(inplace=True)
dataframe.shape

(773, 5)

# fillna() - filling the missing values


### üìã `fillna()` Parameters

| **Parameter** | **Type**           | **Default** | **Description** |
|---------------|--------------------|-------------|------------------|
| `value`       | scalar, dict, Series, or DataFrame | `None`      | Value to use to fill NaNs (can be a single value or a mapping by column) |
| `method`      | str                | `None`      | Method to use for filling: <br>`'ffill'` = forward fill, `'bfill'` = backward fill |
| `axis`        | {0 or ‚Äòindex‚Äô, 1 or ‚Äòcolumns‚Äô} | `None`      | Axis along which to fill missing values |
| `inplace`     | bool               | `False`     | If `True`, modifies the DataFrame in place |
| `limit`       | int                | `None`      | Maximum number of NaNs to fill in a forward/backward fill |


In [349]:
data = pd.read_csv("car.csv")
data.head()


Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Honda,White,35431.0,4.0,15323.0
1,BMW,Blue,192714.0,5.0,19943.0
2,Honda,White,84714.0,4.0,28343.0
3,Toyota,White,154365.0,4.0,13434.0
4,Nissan,Blue,181577.0,3.0,14043.0


In [350]:
data.fillna("python")

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Honda,White,35431.0,4.0,15323.0
1,BMW,Blue,192714.0,5.0,19943.0
2,Honda,White,84714.0,4.0,28343.0
3,Toyota,White,154365.0,4.0,13434.0
4,Nissan,Blue,181577.0,3.0,14043.0
...,...,...,...,...,...
995,Toyota,Black,35820.0,4.0,32042.0
996,python,White,155144.0,3.0,5716.0
997,Nissan,Blue,66604.0,4.0,31570.0
998,Honda,White,215883.0,4.0,4001.0


In [351]:
data.fillna(method="ffill")

  data.fillna(method="ffill")


Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Honda,White,35431.0,4.0,15323.0
1,BMW,Blue,192714.0,5.0,19943.0
2,Honda,White,84714.0,4.0,28343.0
3,Toyota,White,154365.0,4.0,13434.0
4,Nissan,Blue,181577.0,3.0,14043.0
...,...,...,...,...,...
995,Toyota,Black,35820.0,4.0,32042.0
996,Toyota,White,155144.0,3.0,5716.0
997,Nissan,Blue,66604.0,4.0,31570.0
998,Honda,White,215883.0,4.0,4001.0


In [352]:
data.fillna(method="bfill")

  data.fillna(method="bfill")


Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Honda,White,35431.0,4.0,15323.0
1,BMW,Blue,192714.0,5.0,19943.0
2,Honda,White,84714.0,4.0,28343.0
3,Toyota,White,154365.0,4.0,13434.0
4,Nissan,Blue,181577.0,3.0,14043.0
...,...,...,...,...,...
995,Toyota,Black,35820.0,4.0,32042.0
996,Nissan,White,155144.0,3.0,5716.0
997,Nissan,Blue,66604.0,4.0,31570.0
998,Honda,White,215883.0,4.0,4001.0


In [353]:
data.fillna({"Make":"java"}) # we can pass multiple column as well.

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Honda,White,35431.0,4.0,15323.0
1,BMW,Blue,192714.0,5.0,19943.0
2,Honda,White,84714.0,4.0,28343.0
3,Toyota,White,154365.0,4.0,13434.0
4,Nissan,Blue,181577.0,3.0,14043.0
...,...,...,...,...,...
995,Toyota,Black,35820.0,4.0,32042.0
996,java,White,155144.0,3.0,5716.0
997,Nissan,Blue,66604.0,4.0,31570.0
998,Honda,White,215883.0,4.0,4001.0


In [354]:
#along with axis ffill
data.fillna(method="bfill",axis=1)

  data.fillna(method="bfill",axis=1)
  data.fillna(method="bfill",axis=1)


Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Honda,White,35431.0,4.0,15323.0
1,BMW,Blue,192714.0,5.0,19943.0
2,Honda,White,84714.0,4.0,28343.0
3,Toyota,White,154365.0,4.0,13434.0
4,Nissan,Blue,181577.0,3.0,14043.0
...,...,...,...,...,...
995,Toyota,Black,35820.0,4.0,32042.0
996,White,White,155144.0,3.0,5716.0
997,Nissan,Blue,66604.0,4.0,31570.0
998,Honda,White,215883.0,4.0,4001.0


In [358]:
#inplace with method
data.fillna(method="ffill",inplace=True)
data

  data.fillna(method="ffill",inplace=True)


Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Honda,White,35431.0,4.0,15323.0
1,BMW,Blue,192714.0,5.0,19943.0
2,Honda,White,84714.0,4.0,28343.0
3,Toyota,White,154365.0,4.0,13434.0
4,Nissan,Blue,181577.0,3.0,14043.0
...,...,...,...,...,...
995,Toyota,Black,35820.0,4.0,32042.0
996,Toyota,White,155144.0,3.0,5716.0
997,Nissan,Blue,66604.0,4.0,31570.0
998,Honda,White,215883.0,4.0,4001.0


In [361]:
#inplace with value
df5 = pd.read_csv("car.csv")
df5.fillna(value=0,inplace=True)
df5


Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Honda,White,35431.0,4.0,15323.0
1,BMW,Blue,192714.0,5.0,19943.0
2,Honda,White,84714.0,4.0,28343.0
3,Toyota,White,154365.0,4.0,13434.0
4,Nissan,Blue,181577.0,3.0,14043.0
...,...,...,...,...,...
995,Toyota,Black,35820.0,4.0,32042.0
996,0,White,155144.0,3.0,5716.0
997,Nissan,Blue,66604.0,4.0,31570.0
998,Honda,White,215883.0,4.0,4001.0


In [363]:
#limit parameter
df6 = pd.read_csv("car.csv")
df6.fillna(method="ffill",limit=1)

  df6.fillna(method="ffill",limit=1)


Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Honda,White,35431.0,4.0,15323.0
1,BMW,Blue,192714.0,5.0,19943.0
2,Honda,White,84714.0,4.0,28343.0
3,Toyota,White,154365.0,4.0,13434.0
4,Nissan,Blue,181577.0,3.0,14043.0
...,...,...,...,...,...
995,Toyota,Black,35820.0,4.0,32042.0
996,Toyota,White,155144.0,3.0,5716.0
997,Nissan,Blue,66604.0,4.0,31570.0
998,Honda,White,215883.0,4.0,4001.0


# Handling Missing Data

#replace

replace() ‚Äì Replace Specific Values
Replaces values in a DataFrame or Series with another value or set of values.

### üîÅ `replace()` Parameters

| **Parameter**    | **Type**                    | **Default** | **Description** |
|------------------|-----------------------------|-------------|------------------|
| `to_replace`     | scalar, list, dict, regex   | `None`      | The value(s) to be replaced |
| `value`          | scalar, list, dict          | `None`      | The replacement value(s) |
| `inplace`        | bool                        | `False`     | If `True`, modify the object in place |
| `limit`          | int                         | `None`      | Maximum number of replacements |
| `regex`          | bool or regex pattern       | `False`     | Whether to interpret `to_replace` as a regex |
| `method`         | str                         | `'pad'`     | Method when replacing with `NaN`: 'pad', 'ffill', 'bfill' |


In [364]:
import pandas as pd
df = pd.read_csv("car.csv")
df

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Honda,White,35431.0,4.0,15323.0
1,BMW,Blue,192714.0,5.0,19943.0
2,Honda,White,84714.0,4.0,28343.0
3,Toyota,White,154365.0,4.0,13434.0
4,Nissan,Blue,181577.0,3.0,14043.0
...,...,...,...,...,...
995,Toyota,Black,35820.0,4.0,32042.0
996,,White,155144.0,3.0,5716.0
997,Nissan,Blue,66604.0,4.0,31570.0
998,Honda,White,215883.0,4.0,4001.0


In [366]:
# replace function
df.replace(to_replace="Toyota",value="python")

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Honda,White,35431.0,4.0,15323.0
1,BMW,Blue,192714.0,5.0,19943.0
2,Honda,White,84714.0,4.0,28343.0
3,python,White,154365.0,4.0,13434.0
4,Nissan,Blue,181577.0,3.0,14043.0
...,...,...,...,...,...
995,python,Black,35820.0,4.0,32042.0
996,,White,155144.0,3.0,5716.0
997,Nissan,Blue,66604.0,4.0,31570.0
998,Honda,White,215883.0,4.0,4001.0


In [367]:
df.replace(['Honda',"BMW"],500)

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,500,White,35431.0,4.0,15323.0
1,500,Blue,192714.0,5.0,19943.0
2,500,White,84714.0,4.0,28343.0
3,Toyota,White,154365.0,4.0,13434.0
4,Nissan,Blue,181577.0,3.0,14043.0
...,...,...,...,...,...
995,Toyota,Black,35820.0,4.0,32042.0
996,,White,155144.0,3.0,5716.0
997,Nissan,Blue,66604.0,4.0,31570.0
998,500,White,215883.0,4.0,4001.0


In [370]:
#using regex
df.replace("[A-Za-z]","p",regex=True)

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,ppppp,ppppp,35431.0,4.0,15323.0
1,ppp,pppp,192714.0,5.0,19943.0
2,ppppp,ppppp,84714.0,4.0,28343.0
3,pppppp,ppppp,154365.0,4.0,13434.0
4,pppppp,pppp,181577.0,3.0,14043.0
...,...,...,...,...,...
995,pppppp,ppppp,35820.0,4.0,32042.0
996,,ppppp,155144.0,3.0,5716.0
997,pppppp,pppp,66604.0,4.0,31570.0
998,ppppp,ppppp,215883.0,4.0,4001.0


In [372]:
#using dictionary we can raplace particular column data
df.replace({"Make":"[A-Za-z]"},"python",regex=True)

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,pythonpythonpythonpythonpython,White,35431.0,4.0,15323.0
1,pythonpythonpython,Blue,192714.0,5.0,19943.0
2,pythonpythonpythonpythonpython,White,84714.0,4.0,28343.0
3,pythonpythonpythonpythonpythonpython,White,154365.0,4.0,13434.0
4,pythonpythonpythonpythonpythonpython,Blue,181577.0,3.0,14043.0
...,...,...,...,...,...
995,pythonpythonpythonpythonpythonpython,Black,35820.0,4.0,32042.0
996,,White,155144.0,3.0,5716.0
997,pythonpythonpythonpythonpythonpython,Blue,66604.0,4.0,31570.0
998,pythonpythonpythonpythonpython,White,215883.0,4.0,4001.0


In [376]:
#backword replace
df.replace("java",method="bfill")

  df.replace("java",method="bfill")


Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Honda,White,35431.0,4.0,15323.0
1,BMW,Blue,192714.0,5.0,19943.0
2,Honda,White,84714.0,4.0,28343.0
3,Toyota,White,154365.0,4.0,13434.0
4,Nissan,Blue,181577.0,3.0,14043.0
...,...,...,...,...,...
995,Toyota,Black,35820.0,4.0,32042.0
996,,White,155144.0,3.0,5716.0
997,Nissan,Blue,66604.0,4.0,31570.0
998,Honda,White,215883.0,4.0,4001.0


In [377]:
#forward repalce
df.replace("java",method="ffill")

  df.replace("java",method="ffill")


Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Honda,White,35431.0,4.0,15323.0
1,BMW,Blue,192714.0,5.0,19943.0
2,Honda,White,84714.0,4.0,28343.0
3,Toyota,White,154365.0,4.0,13434.0
4,Nissan,Blue,181577.0,3.0,14043.0
...,...,...,...,...,...
995,Toyota,Black,35820.0,4.0,32042.0
996,,White,155144.0,3.0,5716.0
997,Nissan,Blue,66604.0,4.0,31570.0
998,Honda,White,215883.0,4.0,4001.0


In [378]:
#using limit parameter
df.replace("java",method="ffill",limit=1)

  df.replace("java",method="ffill",limit=1)


Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Honda,White,35431.0,4.0,15323.0
1,BMW,Blue,192714.0,5.0,19943.0
2,Honda,White,84714.0,4.0,28343.0
3,Toyota,White,154365.0,4.0,13434.0
4,Nissan,Blue,181577.0,3.0,14043.0
...,...,...,...,...,...
995,Toyota,Black,35820.0,4.0,32042.0
996,,White,155144.0,3.0,5716.0
997,Nissan,Blue,66604.0,4.0,31570.0
998,Honda,White,215883.0,4.0,4001.0


In [379]:
#using inplace parameter
df.replace("java",method="ffill",inplace=True)
df

  df.replace("java",method="ffill",inplace=True)


Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Honda,White,35431.0,4.0,15323.0
1,BMW,Blue,192714.0,5.0,19943.0
2,Honda,White,84714.0,4.0,28343.0
3,Toyota,White,154365.0,4.0,13434.0
4,Nissan,Blue,181577.0,3.0,14043.0
...,...,...,...,...,...
995,Toyota,Black,35820.0,4.0,32042.0
996,,White,155144.0,3.0,5716.0
997,Nissan,Blue,66604.0,4.0,31570.0
998,Honda,White,215883.0,4.0,4001.0


# interpolate

interpolate() ‚Äì Estimate Missing Data
Fills missing values using interpolation (e.g., linear, polynomial).

#### NOTE : it work with numbers only.



### üîÑ `interpolate()` Parameters

| **Parameter**        | **Type**      | **Default**   | **Description** |
|----------------------|---------------|---------------|------------------|
| `method`             | str           | `'linear'`    | Interpolation method: `'linear'`, `'time'`, `'index'`, `'polynomial'`, etc. |
| `axis`               | int           | `0`           | Axis to interpolate along (0 = rows, 1 = columns) |
| `limit`              | int           | `None`        | Maximum number of NaNs to fill |
| `inplace`            | bool          | `False`       | Modify the original DataFrame if `True` |
| `limit_direction`    | str           | `'forward'`   | Direction: `'forward'`, `'backward'`, or `'both'` |


In [380]:
import pandas as pd
df = pd.read_csv("car.csv")
df

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Honda,White,35431.0,4.0,15323.0
1,BMW,Blue,192714.0,5.0,19943.0
2,Honda,White,84714.0,4.0,28343.0
3,Toyota,White,154365.0,4.0,13434.0
4,Nissan,Blue,181577.0,3.0,14043.0
...,...,...,...,...,...
995,Toyota,Black,35820.0,4.0,32042.0
996,,White,155144.0,3.0,5716.0
997,Nissan,Blue,66604.0,4.0,31570.0
998,Honda,White,215883.0,4.0,4001.0


In [381]:
df.interpolate()

  df.interpolate()


Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Honda,White,35431.0,4.0,15323.0
1,BMW,Blue,192714.0,5.0,19943.0
2,Honda,White,84714.0,4.0,28343.0
3,Toyota,White,154365.0,4.0,13434.0
4,Nissan,Blue,181577.0,3.0,14043.0
...,...,...,...,...,...
995,Toyota,Black,35820.0,4.0,32042.0
996,,White,155144.0,3.0,5716.0
997,Nissan,Blue,66604.0,4.0,31570.0
998,Honda,White,215883.0,4.0,4001.0


In [382]:
#linear method
df.interpolate(method="linear")

  df.interpolate(method="linear")


Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Honda,White,35431.0,4.0,15323.0
1,BMW,Blue,192714.0,5.0,19943.0
2,Honda,White,84714.0,4.0,28343.0
3,Toyota,White,154365.0,4.0,13434.0
4,Nissan,Blue,181577.0,3.0,14043.0
...,...,...,...,...,...
995,Toyota,Black,35820.0,4.0,32042.0
996,,White,155144.0,3.0,5716.0
997,Nissan,Blue,66604.0,4.0,31570.0
998,Honda,White,215883.0,4.0,4001.0


In [385]:
# axis parameter - working with axis = 1 then data type of all columns are same.
df.interpolate(method="linear",axis=0)

  df.interpolate(method="linear",axis=0)


Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Honda,White,35431.0,4.0,15323.0
1,BMW,Blue,192714.0,5.0,19943.0
2,Honda,White,84714.0,4.0,28343.0
3,Toyota,White,154365.0,4.0,13434.0
4,Nissan,Blue,181577.0,3.0,14043.0
...,...,...,...,...,...
995,Toyota,Black,35820.0,4.0,32042.0
996,,White,155144.0,3.0,5716.0
997,Nissan,Blue,66604.0,4.0,31570.0
998,Honda,White,215883.0,4.0,4001.0


In [387]:
#limit parameter
df.interpolate(method="linear",limit=1)

  df.interpolate(method="linear",limit=1)


Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Honda,White,35431.0,4.0,15323.0
1,BMW,Blue,192714.0,5.0,19943.0
2,Honda,White,84714.0,4.0,28343.0
3,Toyota,White,154365.0,4.0,13434.0
4,Nissan,Blue,181577.0,3.0,14043.0
...,...,...,...,...,...
995,Toyota,Black,35820.0,4.0,32042.0
996,,White,155144.0,3.0,5716.0
997,Nissan,Blue,66604.0,4.0,31570.0
998,Honda,White,215883.0,4.0,4001.0



Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.




Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.




Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.




Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.



In [389]:
#limit_direction
# df.interpolate(method="linear",limit_direction="backward")
# df.interpolate(method="linear",limit_direction="forward")
df.interpolate(method="linear",limit_direction="both")

  df.interpolate(method="linear",limit_direction="both")


Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Honda,White,35431.0,4.0,15323.0
1,BMW,Blue,192714.0,5.0,19943.0
2,Honda,White,84714.0,4.0,28343.0
3,Toyota,White,154365.0,4.0,13434.0
4,Nissan,Blue,181577.0,3.0,14043.0
...,...,...,...,...,...
995,Toyota,Black,35820.0,4.0,32042.0
996,,White,155144.0,3.0,5716.0
997,Nissan,Blue,66604.0,4.0,31570.0
998,Honda,White,215883.0,4.0,4001.0


In [392]:
# limit_area
#inside -> fill all null values
#outside -> cannot fill null values keep same
df.interpolate(method="linear",limit_area="inside")
# df.interpolate(method="linear",limit_area="outside")

  df.interpolate(method="linear",limit_area="inside")


Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Honda,White,35431.0,4.0,15323.0
1,BMW,Blue,192714.0,5.0,19943.0
2,Honda,White,84714.0,4.0,28343.0
3,Toyota,White,154365.0,4.0,13434.0
4,Nissan,Blue,181577.0,3.0,14043.0
...,...,...,...,...,...
995,Toyota,Black,35820.0,4.0,32042.0
996,,White,155144.0,3.0,5716.0
997,Nissan,Blue,66604.0,4.0,31570.0
998,Honda,White,215883.0,4.0,4001.0


# merge function

merge() ‚Äì Merge DataFrames
The merge() function in Pandas is used to combine two DataFrames based on a common column(s) or index, similar to SQL joins.

### üîó `merge()` Parameters ‚Äì Table Format

| **Parameter**     | **Type**            | **Default**   | **Description** |
|------------------|---------------------|---------------|------------------|
| `left`           | DataFrame           | ‚Äî             | First DataFrame to merge |
| `right`          | DataFrame           | ‚Äî             | Second DataFrame to merge |
| `how`            | str                 | `'inner'`     | Type of merge: `'left'`, `'right'`, `'outer'`, `'inner'` |
| `on`             | label or list       | `None`        | Column(s) to join on (must be in both DataFrames) |
| `left_on`        | label or list       | `None`        | Column(s) from the left DataFrame to join on |
| `right_on`       | label or list       | `None`        | Column(s) from the right DataFrame to join on |
| `left_index`     | bool                | `False`       | Use index from the left DataFrame |
| `right_index`    | bool                | `False`       | Use index from the right DataFrame |
| `sort`           | bool                | `False`       | Sort merged result by join keys |
| `suffixes`       | tuple of (str, str) | `('_x', '_y')`| Suffixes to apply to overlapping column names |
| `copy`           | bool                | `True`        | Copy data into a new DataFrame |
| `indicator`      | bool or str         | `False`       | Add a column showing merge source (left_only, right_only, both) |
| `validate`       | str                 | `None`        | Check merge types: e.g. `'one_to_one'`, `'one_to_many'` |


### üî∏ Merge Types (`how`)

| **Method** | **Description** |
|------------|------------------|
| `'inner'`  | Only matching rows from both DataFrames |
| `'outer'`  | All rows from both DataFrames, with NaN where no match |
| `'left'`   | All rows from the left DataFrame, with matching from the right |
| `'right'`  | All rows from the right DataFrame, with matching from the left |


In [408]:
import pandas as pd
df1 = pd.DataFrame({
    "a":[1,2,3,4,5],
    "b":[6,7,8,9,10],
    "c":[11,12,13,1,2]})
df1


Unnamed: 0,a,b,c
0,1,6,11
1,2,7,12
2,3,8,13
3,4,9,1
4,5,10,2


In [409]:
df2 = pd.DataFrame({
    "a":[1,2,3,4,6],
    "b":[6,7,8,9,10],
    "d":[21,22,23,24,25]})
df2

Unnamed: 0,a,b,d
0,1,6,21
1,2,7,22
2,3,8,23
3,4,9,24
4,6,10,25


In [410]:
#merge 2 dataframe = only gives common data
df3 = pd.merge(df1,df2)
df3

Unnamed: 0,a,b,c,d
0,1,6,11,21
1,2,7,12,22
2,3,8,13,23
3,4,9,1,24


In [411]:
df4 = pd.merge(df1,df2,how="inner")
df4

Unnamed: 0,a,b,c,d
0,1,6,11,21
1,2,7,12,22
2,3,8,13,23
3,4,9,1,24


In [412]:
df5 = pd.merge(df1,df2,how="outer")
df5

Unnamed: 0,a,b,c,d
0,1,6,11.0,21.0
1,2,7,12.0,22.0
2,3,8,13.0,23.0
3,4,9,1.0,24.0
4,5,10,2.0,
5,6,10,,25.0


In [413]:
df6 = pd.merge(df1,df2,how="left")
df6

Unnamed: 0,a,b,c,d
0,1,6,11,21.0
1,2,7,12,22.0
2,3,8,13,23.0
3,4,9,1,24.0
4,5,10,2,


In [414]:
df7 = pd.merge(df1,df2,how="right")
df7

Unnamed: 0,a,b,c,d
0,1,6,11.0,21
1,2,7,12.0,22
2,3,8,13.0,23
3,4,9,1.0,24
4,6,10,,25


In [415]:
df8 = pd.merge(df1,df2,right_index=True,left_index=True)
df8

Unnamed: 0,a_x,b_x,c,a_y,b_y,d
0,1,6,11,1,6,21
1,2,7,12,2,7,22
2,3,8,13,3,8,23
3,4,9,1,4,9,24
4,5,10,2,6,10,25


In [417]:
#suffix
df9 = pd.merge(df1,df2,right_index=True,left_index=True,suffixes=("_left","_right"))
df9

Unnamed: 0,a_left,b_left,c,a_right,b_right,d
0,1,6,11,1,6,21
1,2,7,12,2,7,22
2,3,8,13,3,8,23
3,4,9,1,4,9,24
4,5,10,2,6,10,25


# concat function

concat() ‚Äì Concatenate DataFrames or Series
The pd.concat() function is used to concatenate (combine) two or more DataFrames or Series along a particular axis (rows or columns).



### üîó `concat()` Parameters

| **Parameter**        | **Type**        | **Default**   | **Description** |
|----------------------|-----------------|---------------|------------------|
| `objs`               | list or tuple   | ‚Äî             | List of Series/DataFrames to concatenate |
| `axis`               | int             | `0`           | Axis to concatenate along: `0` = rows (vertical), `1` = columns (horizontal) |
| `join`               | str             | `'outer'`     | How to handle indexes: `'outer'` (union), `'inner'` (intersection) |
| `ignore_index`       | bool            | `False`       | If `True`, do not preserve the original index |
| `keys`               | sequence        | `None`        | Create a hierarchical index using these keys |
| `levels`             | list of levels  | `None`        | Specific levels for hierarchical index if `keys` is used |
| `names`              | list of str     | `None`        | Names for hierarchical levels if `keys` is used |
| `verify_integrity`   | bool            | `False`       | Check for duplicate indices and raise error if found |
| `sort`               | bool            | `False`       | Sort non-concatenation axis if it is not aligned |
| `copy`               | bool            | `True`        | If `False`, avoid copying data unnecessarily |


In [418]:
#concat
import pandas as pd
df1 = pd.DataFrame({
    "a":[1,2,3,4,5],
    "b":[6,7,8,9,10],
    "c":[11,12,13,1,2]})
df1


Unnamed: 0,a,b,c
0,1,6,11
1,2,7,12
2,3,8,13
3,4,9,1
4,5,10,2


In [419]:
df2 = pd.DataFrame({
    "a":[1,2,3,4,6],
    "b":[6,7,8,9,10],
    "d":[21,22,23,24,25]})
df2

Unnamed: 0,a,b,d
0,1,6,21
1,2,7,22
2,3,8,23
3,4,9,24
4,6,10,25


In [420]:
df3 = pd.concat([df1,df2])
df3

Unnamed: 0,a,b,c,d
0,1,6,11.0,
1,2,7,12.0,
2,3,8,13.0,
3,4,9,1.0,
4,5,10,2.0,
0,1,6,,21.0
1,2,7,,22.0
2,3,8,,23.0
3,4,9,,24.0
4,6,10,,25.0


In [421]:
#using Series
df4 = pd.Series([1,2,3,4,5])
df5 = pd.Series([6,7,8,9,10])
df6 = pd.concat([df4,df5])
df6

Unnamed: 0,0
0,1
1,2
2,3
3,4
4,5
0,6
1,7
2,8
3,9
4,10
