**PANDAS**

> Add blockquote



* Pandas is a Python library used for working with data sets.
* It has functions for analyzing, cleaning, exploring, and manipulating data.
* Pandas allows us to analyze big data and make conclusions based on statistical theories.
* Pandas can clean messy data sets, and make them readable and relevant.
* Relevant data is very important in data science.
* In pandas, One dimension is called Series. Two dimension is called DataFrame. Three dimension is called Panel.

## **class in pandas**



## 1. **Series**



* A Pandas Series is like a column in a table.
* A one-dimensional array-like object that can hold data of any type (e.g., integers, strings, floats).

In [None]:
import pandas as pd

data = ['abhay','rahul','ravi']
sr = pd.Series(data)
print(sr)



0    abhay
1    rahul
2     ravi
dtype: object


In [None]:
import pandas as pd

data1 = [1,2,3,4,5]
sr1 = pd.Series(data1)
sr1

Unnamed: 0,0
0,1
1,2
2,3
3,4
4,5


In [None]:
#  Creates a NumPy array with elements 'abhay', 'rahul', and 'ravi'
# Converts the NumPy array into a pandas Series object. Each element gets an index (0, 1, 2) by default.

import numpy as np
import pandas as pd

data = np.array(['abhay','rahul','ravi'])
sr2 = pd.Series(data)
sr2

Unnamed: 0,0
0,abhay
1,rahul
2,ravi


In [None]:
# Custom indices [100, 101, 102] replace the default indices [0, 1, 2].

import numpy as np
import pandas as pd

data = np.array(['abhay','rahul','ravi'])
sr2 = pd.Series(data,index=[100,101,102])   # custom index
print(sr2)

100    abhay
101    rahul
102     ravi
dtype: object


**SERIES DATA INDEXING AND SLICING**

      INDEXING : Indexing in Series You can access individual elements using

In [None]:
# INDEXING

import numpy as np
import pandas as pd

data = np.array(['abhay','rahul','ravi'])
sr2 = pd.Series(data,index=[100,101,102])


print(sr2[100])     # geting the data from 100th index number



abhay


      SLICING : You can retrieve multiple elements using slicing

In [None]:
# SLICING

import numpy as np
import pandas as pd

data = np.array(['abhay','rahul','ravi'])
sr2 = pd.Series(data,index=[100,101,102])


sr2[0:3]

Unnamed: 0,0
100,abhay
101,rahul
102,ravi


## 2. **DATA FRAME**


* A DataFrame in pandas is a two-dimensional, tabular data structure with labeled rows and columns.
* It is one of the most widely used data structures in pandas for handling structured data, similar to a table in databases or Excel.

In [None]:
import pandas as pd

data = [[1,'abhay',30000],[2,'ramu',40000],[3,'kevin',50000]]
df = pd.DataFrame(data)
df

Unnamed: 0,0,1,2
0,1,abhay,30000
1,2,ramu,40000
2,3,kevin,50000


In [None]:
# create dataframe with specific column name

import pandas as pd

data = [[1,'abhay',30000],[2,'ramu',40000],[3,'kevin',50000]]
df = pd.DataFrame(data,index=[100,101,102],columns=['id','name','salary'])
df

Unnamed: 0,id,name,salary
100,1,abhay,30000
101,2,ramu,40000
102,3,kevin,50000


In [None]:
# std id,name, biomark,pymark

import pandas as pd
data = [[1,'abhay',70,80],[2,'teju',77,89],[3,'aswanth',89,87]]
df = pd.DataFrame(data,columns=['id','name','biology mark','physics mark'])
df

Unnamed: 0,id,name,biology mark,physics mark
0,1,abhay,70,80
1,2,teju,77,89
2,3,aswanth,89,87


In [None]:
# std id,name, biomark,pymark using dictionary datatype         key = col name :  val = [values]

import pandas as pd
data = {'id':[1,2,3],'name':['abhay','teju','aswanth'],'biology mark':[70,77,89],'physics mark':[80,89,87]}
df = pd.DataFrame(data)
df

Unnamed: 0,id,name,biology mark,physics mark
0,1,abhay,70,80
1,2,teju,77,89
2,3,aswanth,89,87


**DATA INSPECT METHOD**   (data.method())

      1. head()     - Displays the first n rows (default is 5).
      2. tail()     - Displays the last n rows (default is 5).
      3. info()     - Provides a summary of the DataFrame, including data types, non-null values, and memory usage.
      4. describe() - Generates summary statistics for numerical data.
      5. isnull()   - Checks for missing values in the data.
      6. notnull()  - Opposite of .isnull().



In [None]:
df.head(2)

Unnamed: 0,id,name,biology mark,physics mark
0,1,abhay,70,80
1,2,teju,77,89


In [None]:
df.tail(2)

Unnamed: 0,id,name,biology mark,physics mark
1,2,teju,77,89
2,3,aswanth,89,87


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   id            3 non-null      int64 
 1   name          3 non-null      object
 2   biology mark  3 non-null      int64 
 3   physics mark  3 non-null      int64 
dtypes: int64(3), object(1)
memory usage: 224.0+ bytes


In [None]:
df.describe()

Unnamed: 0,id,biology mark,physics mark
count,3.0,3.0,3.0
mean,2.0,78.666667,85.333333
std,1.0,9.609024,4.725816
min,1.0,70.0,80.0
25%,1.5,73.5,83.5
50%,2.0,77.0,87.0
75%,2.5,83.0,88.0
max,3.0,89.0,89.0


In [None]:
df.isnull()

Unnamed: 0,id,name,biology mark,physics mark
0,False,False,False,False
1,False,False,False,False
2,False,False,False,False


In [None]:
df.notnull()

Unnamed: 0,id,name,biology mark,physics mark
0,True,True,True,True
1,True,True,True,True
2,True,True,True,True


**CLASS ATTRIBUTE**   -- print(data.attribute)

      1. shape()     - Returns the number of rows and columns.
      2. ndim()      - Returns the number of dimensions (1 for Series, 2 for DataFrame).
      3. columns()   - Lists column names of a DataFrame.
      4. index()     - Provides the index (row labels).
      5. dtype()     - Returns the data types of each column.
      6. value()     - Provides the underlying numpy array of the data
      7. value_count - Counts unique values for a column or Series and returns them with their frequency.
      8. unique      - to get unique values


In [None]:
#for specific column:

print(df['name'].value_counts())

# for all data

df.value_counts()

name
abhay      1
teju       1
aswanth    1
Name: count, dtype: int64


Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,count
id,name,biology mark,physics mark,Unnamed: 4_level_1
1,abhay,70,80,1
2,teju,77,89,1
3,aswanth,89,87,1


In [None]:
df['name'].unique()

array(['abhay', 'teju', 'aswanth'], dtype=object)

In [None]:
df.shape

(3, 4)

In [None]:
df.values

array([[1, 'abhay', 70, 80],
       [2, 'teju', 77, 89],
       [3, 'aswanth', 89, 87]], dtype=object)

In [None]:
df.size

12

In [None]:
df.ndim

2

In [None]:
df.dtypes

Unnamed: 0,0
id,int64
name,object
biology mark,int64
physics mark,int64


In [None]:
df.columns

Index(['id', 'name', 'biology mark', 'physics mark'], dtype='object')

**DATA FRAME INDEXING AND SLICING**

* INDEXING
      * [] (Bracket Notation): For columns or slicing rows.

In [None]:
df['name']

Unnamed: 0,name
0,abhay
1,teju
2,aswanth


In [None]:
df[['name','biology mark']]

Unnamed: 0,name,biology mark
0,abhay,70
1,teju,77
2,aswanth,89


* SLICING
      Slicing is used to select subsets of rows and/or columns.

In [None]:
df[1:3]

Unnamed: 0,id,name,biology mark,physics mark
1,2,teju,77,89
2,3,aswanth,89,87


**LOC & ILOC ----- INDEXING AND SLICING**

      * loc[]  : label-based indexing.
      * iloc[] : Integer position-based indexing and slicing.

INDEXING

In [None]:
df.loc[0]     #row

Unnamed: 0,0
id,1
name,abhay
biology mark,70
physics mark,80


In [None]:
df.loc[1,'name']    #row col

'teju'

In [None]:
df.iloc[0]

Unnamed: 0,0
id,1
name,abhay
biology mark,70
physics mark,80


In [None]:
df.iloc[0,2]   #row and col_num

70

SLICING

In [None]:
df.loc[0:2,'name':'physics mark']

Unnamed: 0,name,biology mark,physics mark
0,abhay,70,80
1,teju,77,89
2,aswanth,89,87


In [None]:
df.iloc[0:2,1:4]    # row,col_num

Unnamed: 0,name,biology mark,physics mark
0,abhay,70,80
1,teju,77,89


**AT AND IAT ------ (indexing ONLY NOT SLICING)**

*In pandas, .at[] and .iat[] are optimized methods for accessing single elements in a DataFrame or Series. They are similar to .loc[] and .iloc[], but are specifically designed for fast access to a single value (i.e., one row and one column).

      Difference Between .at[] and .iat[]

      * .at[]: Used for label-based indexing (similar to .loc[]), but for single values.
      * .iat[]: Used for integer position-based indexing (similar to .iloc[]), but for single values.

These are faster than .loc[] and .iloc[] because they are designed to retrieve only one value and do not require slicing or other operations.

INDEXING

In [None]:
df.at[0,'name']

'abhay'

In [None]:
df.iat[1,1]

'teju'

## QUESTION

In [None]:
# QUESTION
# 1. ADD 10 MARK FOR PHYSIS FOR ALL

data = {'id':[1,2,3],'name':['abhay','teju','aswanth'],'biology mark':[70,77,89],'physics mark':[80,89,87]}
df = pd.DataFrame(data)
df

df['physics mark'] += 10
print(df['physics mark'])

0    90
1    99
2    97
Name: physics mark, dtype: int64


In [None]:
# QUESTION
# 2. ADD 2 column to the table maths mark and social mark

df['maths mark'] = [70,77,89]
df['social mark'] = [80,89,87]
print(df)

   id     name  biology mark  physics mark  maths mark  social mark
0   1    abhay            70            90          70           80
1   2     teju            77            99          77           89
2   3  aswanth            89            97          89           87


 ## DF DOING MULTIPLE METHODS AND ATTRBUTES

In [None]:
import pandas as pd

data = [[1,'abhay','kozhikode',98],[2,'ramu','malapuram',89],[3,'kevin','kannur',78],[4,'ajmal','palakkad',54],[5,'rajul','kozhikode',68]]
df = pd.DataFrame(data,columns=['id','name','place','mark'])
print(df)


print('------------------------------------------')

print(df.iloc[1])

print('------------------------------------------')

print(df.loc[0:2])

print('------------------------------------------')

print(df.iloc[0:2,0:2])

print('------------------------------------------')

print(df.loc[0:2,'name':'place'])

print('------------------------------------------')

print(df.iat[1,2])

print('------------------------------------------')

print(df.at[1,'name'])

print('------------------------------------------')

df.mark += 10
print(df)

print('------------------------------------------')

x=df[df['mark']>80]
print(x)

print('------------------------------------------')

x=df[(df['mark']>80) & (df['place']=='kozhikode')]
print(x)

print('------------------------------------------')

df.rename(columns={'mark':'chemistry mark'},inplace=True)    # to change in same dataframe
print(df)

df1=df.rename(columns={'mark':'chemistry mark'})    # to change name and assing df to another variable
print(df1)

print('------------------------------------------')

# add new data to the table row
df.loc[len(df)] = [6,'john','palakkad',99]
print(df)

print('------------------------------------------')

# update values in the table
df['place']=df['place'].replace({'kozhikode':'calicut','palakkad':'coimbatore'})   # if we not specify the df[place] then give inplace=true
print(df)

print('------------------------------------------')

# to drop a colum or row of data   ( axis default row = 0  col = 1)

df1 = df.drop(columns=['id'])  #column
print(df1)

df1 = df.drop(2,axis=0)       #row
print(df1)

df1 = df.drop('place',axis=1)   #column
print(df1)

print('------------------------------------------')

# another method to drop = pop  only for column
df.pop('id')
print(df)


print('------------------------------------------')

   id   name      place  mark
0   1  abhay  kozhikode    98
1   2   ramu  malapuram    89
2   3  kevin     kannur    78
3   4  ajmal   palakkad    54
4   5  rajul  kozhikode    68
------------------------------------------
id               2
name          ramu
place    malapuram
mark            89
Name: 1, dtype: object
------------------------------------------
   id   name      place  mark
0   1  abhay  kozhikode    98
1   2   ramu  malapuram    89
2   3  kevin     kannur    78
------------------------------------------
   id   name
0   1  abhay
1   2   ramu
------------------------------------------
    name      place
0  abhay  kozhikode
1   ramu  malapuram
2  kevin     kannur
------------------------------------------
malapuram
------------------------------------------
ramu
------------------------------------------
   id   name      place  mark
0   1  abhay  kozhikode   108
1   2   ramu  malapuram    99
2   3  kevin     kannur    88
3   4  ajmal   palakkad    64
4   5  rajul  ko

## AGGIGATE FUNCTION :

In [None]:
# sum
df.sum()  # sum all
df['chemistry mark'].sum()    #sum sp col

536

In [None]:
#min
df['chemistry mark'].min()

64

In [None]:
#max
df['chemistry mark'].max()

108

In [None]:
#prod
df['chemistry mark'].prod()

464998330368

In [None]:
#cumsum
df['chemistry mark'].cumsum()

Unnamed: 0,chemistry mark
0,108
1,207
2,295
3,359
4,437
5,536


In [None]:
#describe
df['chemistry mark'].describe()

Unnamed: 0,chemistry mark
count,6.0
mean,89.333333
std,16.169931
min,64.0
25%,80.5
50%,93.5
75%,99.0
max,108.0


In [None]:
#mean
df['chemistry mark'].mean()

89.33333333333333

In [None]:
#standerd
df['chemistry mark'].std()

16.169930941926335

In [None]:
#median
df['chemistry mark'].median()

93.5

## **apply method in series and datacrame class**

In [None]:
# modify using fun

def add(a):
  return a+10

df['chemistry mark']= df['chemistry mark'].apply(add)
print(df)

    name       place  chemistry mark
0  abhay     calicut             118
1   ramu   malapuram             109
2  kevin      kannur              98
3  ajmal  coimbatore              74
4  rajul     calicut              88
5   john  coimbatore             109


In [None]:
df['place'] = df['place'].apply(lambda x: x.upper())

print(df)

    name       place  chemistry mark
0  abhay     CALICUT             118
1   ramu   MALAPURAM             109
2  kevin      KANNUR              98
3  ajmal  COIMBATORE              74
4  rajul     CALICUT              88
5   john  COIMBATORE             109


In [None]:
def add(x):
  if x =='calicut':
    return 0
  elif x == 'malapuram':
    return 1
  else:
    return 2

df['place'] = df['place'].apply(add)
print(df)

    name  place  chemistry mark
0  abhay      2             118
1   ramu      2             109
2  kevin      2              98
3  ajmal      2              74
4  rajul      2              88
5   john      2             109


##**GROUP BY METHOD**

In [None]:
import pandas as pd

data = [[1,'abhay','kozhikode',98],[2,'ramu','malapuram',89],[3,'kevin','kannur',78],[4,'ajmal','palakkad',54],[5,'rajul','kozhikode',68]]
df = pd.DataFrame(data,columns=['id','name','place','mark'])

df1 = df.groupby('place')
print(df1.count())

print('---------------------------------------')

df1 = df.groupby('place')
print(df1.size())

print('---------------------------------------')

df1 = df.groupby(['place','name'])
print(df1.count())

print('---------------------------------------')

df1 = df.groupby(['place','name'])
print(df1.size())

           id  name  mark
place                    
kannur      1     1     1
kozhikode   2     2     2
malapuram   1     1     1
palakkad    1     1     1
---------------------------------------
place
kannur       1
kozhikode    2
malapuram    1
palakkad     1
dtype: int64
---------------------------------------
                 id  mark
place     name           
kannur    kevin   1     1
kozhikode abhay   1     1
          rajul   1     1
malapuram ramu    1     1
palakkad  ajmal   1     1
---------------------------------------
place      name 
kannur     kevin    1
kozhikode  abhay    1
           rajul    1
malapuram  ramu     1
palakkad   ajmal    1
dtype: int64


##**sort values method**

In [None]:
df1 = df.sort_values('mark')
print(df1)

print('---------------------------------------')

df1 = df.sort_values('mark',ascending=False)
print(df1)

print('---------------------------------------')

df.sort_values('mark',inplace=True,ascending=False)
print(df1)

   id   name      place  mark
3   4  ajmal   palakkad    54
4   5  rajul  kozhikode    68
2   3  kevin     kannur    78
1   2   ramu  malapuram    89
0   1  abhay  kozhikode    98
---------------------------------------
   id   name      place  mark
0   1  abhay  kozhikode    98
1   2   ramu  malapuram    89
2   3  kevin     kannur    78
4   5  rajul  kozhikode    68
3   4  ajmal   palakkad    54
---------------------------------------
   id   name      place  mark
0   1  abhay  kozhikode    98
1   2   ramu  malapuram    89
2   3  kevin     kannur    78
4   5  rajul  kozhikode    68
3   4  ajmal   palakkad    54


##**read csv**

sep = to separate data col  , header = by default 0 if header = none (title will conver to a row) , name = to give our own col name, skiprows =

In [None]:
# prompt: read emmploye_csv from sample data

import pandas as pd
df = pd.read_csv('/content/sample_data/data1.csv',sep=';')
df


Unnamed: 0,MyUnknownColumn,Name,subject_id,Marks_scored
0,1,Alex,sub1,98
1,2,Amy,sub2,90
2,3,Allen,sub4,87
3,4,Alice,sub6,69
4,5,Ayoung,sub5,78
5,6,sarang,sub6,87


In [None]:
# to give specific col name when reading :

import pandas as pd
df = pd.read_csv('/content/sample_data/data1.csv',sep=';',names=['id','name','mark'])
df

Unnamed: 0,id,name,mark
MyUnknownColumn,Name,subject_id,Marks_scored
1,Alex,sub1,98
2,Amy,sub2,90
3,Allen,sub4,87
4,Alice,sub6,69
5,Ayoung,sub5,78
6,sarang,sub6,87


In [None]:
# to give specific col name when reading and skip sertain row :

import pandas as pd
df = pd.read_csv('/content/sample_data/data1.csv',sep=';',names=['id','name','mark'],skiprows=1)
df


Unnamed: 0,id,name,mark
1,Alex,sub1,98
2,Amy,sub2,90
3,Allen,sub4,87
4,Alice,sub6,69
5,Ayoung,sub5,78
6,sarang,sub6,87


## question read the employee csv file

In [None]:
# to give specific col name when reading :

import pandas as pd
df = pd.read_csv('/content/sample_data/employe.csv',sep=';')
df

Unnamed: 0,Education,JoiningYear,City,PaymentTier,Age,Gender,EverBenched,ExperienceInCurrentDomain,LeaveOrNot
0,Bachelors,2017,Bangalore,3,34,Male,No,0,0
1,Bachelors,2013,Pune,1,28,Female,No,3,1
2,Bachelors,2014,New Delhi,3,38,Female,No,2,0
3,Masters,2016,Bangalore,3,27,Male,No,5,1
4,Masters,2017,Pune,3,24,Male,Yes,2,1
...,...,...,...,...,...,...,...,...,...
4648,Bachelors,2013,Bangalore,3,26,Female,No,4,0
4649,Masters,2013,Pune,2,37,Male,No,2,1
4650,Masters,2018,New Delhi,3,27,Male,No,5,1
4651,Bachelors,2012,Bangalore,3,30,Male,Yes,2,0


In [None]:
df.head(2)

Unnamed: 0,Education,JoiningYear,City,PaymentTier,Age,Gender,EverBenched,ExperienceInCurrentDomain,LeaveOrNot
0,Bachelors,2017,Bangalore,3,34,Male,No,0,0
1,Bachelors,2013,Pune,1,28,Female,No,3,1


In [None]:
df.tail(2)

Unnamed: 0,Education,JoiningYear,City,PaymentTier,Age,Gender,EverBenched,ExperienceInCurrentDomain,LeaveOrNot
4651,Bachelors,2012,Bangalore,3,30,Male,Yes,2,0
4652,Bachelors,2015,Bangalore,3,33,Male,Yes,4,0


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4653 entries, 0 to 4652
Data columns (total 9 columns):
 #   Column                     Non-Null Count  Dtype 
---  ------                     --------------  ----- 
 0   Education                  4653 non-null   object
 1   JoiningYear                4653 non-null   int64 
 2   City                       4653 non-null   object
 3   PaymentTier                4653 non-null   int64 
 4   Age                        4653 non-null   int64 
 5   Gender                     4653 non-null   object
 6   EverBenched                4653 non-null   object
 7   ExperienceInCurrentDomain  4653 non-null   int64 
 8   LeaveOrNot                 4653 non-null   int64 
dtypes: int64(5), object(4)
memory usage: 327.3+ KB


In [None]:
df['Education'].value_counts()

Unnamed: 0_level_0,count
Education,Unnamed: 1_level_1
Bachelors,3601
Masters,873
PHD,179


In [None]:
df['Education'].unique()

array(['Bachelors', 'Masters', 'PHD'], dtype=object)

## **JOIN METHODS**     (join mutliple datAFRAME)

In [1]:
import pandas as pd

df = pd.read_csv('/content/sample_data/data1.csv',sep=';')
df

Unnamed: 0,MyUnknownColumn,Name,subject_id,Marks_scored
0,1,Alex,sub1,98
1,2,Amy,sub2,90
2,3,Allen,sub4,87
3,4,Alice,sub6,69
4,5,Ayoung,sub5,78
5,6,sarang,sub6,87


In [2]:
df1 = pd.read_csv('/content/sample_data/data2.csv',sep=';')
df1

Unnamed: 0,MyUnknownColumn,Name,subject_id,Marks_scored
0,1,Billy,sub2,89
1,2,Brian,sub4,80
2,3,Bran,sub3,79
3,4,Bryce,sub6,97
4,5,Betty,sub5,88


1. **CONCAT METHOD** :

* The concat() method in pandas is used to combine multiple DataFrames or Series into a single one along a particular axis (rows or columns).

In [4]:
data=pd.concat([df,df1])
data

Unnamed: 0,MyUnknownColumn,Name,subject_id,Marks_scored
0,1,Alex,sub1,98
1,2,Amy,sub2,90
2,3,Allen,sub4,87
3,4,Alice,sub6,69
4,5,Ayoung,sub5,78
5,6,sarang,sub6,87
0,1,Billy,sub2,89
1,2,Brian,sub4,80
2,3,Bran,sub3,79
3,4,Bryce,sub6,97


In [6]:
# concate based on column

data=pd.concat([df,df1],axis=1)
data

Unnamed: 0,MyUnknownColumn,Name,subject_id,Marks_scored,MyUnknownColumn.1,Name.1,subject_id.1,Marks_scored.1
0,1,Alex,sub1,98,1.0,Billy,sub2,89.0
1,2,Amy,sub2,90,2.0,Brian,sub4,80.0
2,3,Allen,sub4,87,3.0,Bran,sub3,79.0
3,4,Alice,sub6,69,4.0,Bryce,sub6,97.0
4,5,Ayoung,sub5,78,5.0,Betty,sub5,88.0
5,6,sarang,sub6,87,,,,


In [8]:
# concate based on row (BY DEFAULT ROW    axis=0)

data2=pd.concat([df,df1],axis=0)
data2

Unnamed: 0,MyUnknownColumn,Name,subject_id,Marks_scored
0,1,Alex,sub1,98
1,2,Amy,sub2,90
2,3,Allen,sub4,87
3,4,Alice,sub6,69
4,5,Ayoung,sub5,78
5,6,sarang,sub6,87
0,1,Billy,sub2,89
1,2,Brian,sub4,80
2,3,Bran,sub3,79
3,4,Bryce,sub6,97


In [10]:
#ignore_index :  used to reset the index of the resulting DataFrame after concatenation.
# ignore_index=True when you don't need to keep the original indices and want a clean, sequential index in the combined DataFrame.

data=pd.concat([df,df1],axis=0,ignore_index=True)
data

Unnamed: 0,MyUnknownColumn,Name,subject_id,Marks_scored
0,1,Alex,sub1,98
1,2,Amy,sub2,90
2,3,Allen,sub4,87
3,4,Alice,sub6,69
4,5,Ayoung,sub5,78
5,6,sarang,sub6,87
6,1,Billy,sub2,89
7,2,Brian,sub4,80
8,3,Bran,sub3,79
9,4,Bryce,sub6,97


2. **MERGE METHOD** :

* method in pandas is used to combine two DataFrames based on common columns or indices. It is similar to SQL JOIN operations, such as INNER JOIN, LEFT JOIN, RIGHT JOIN, and OUTER JOIN.

In [11]:
data=pd.merge(df,df1,on='subject_id',how='outer')         # outerjoin
data

Unnamed: 0,MyUnknownColumn_x,Name_x,subject_id,Marks_scored_x,MyUnknownColumn_y,Name_y,Marks_scored_y
0,1.0,Alex,sub1,98.0,,,
1,2.0,Amy,sub2,90.0,1.0,Billy,89.0
2,,,sub3,,3.0,Bran,79.0
3,3.0,Allen,sub4,87.0,2.0,Brian,80.0
4,5.0,Ayoung,sub5,78.0,5.0,Betty,88.0
5,4.0,Alice,sub6,69.0,4.0,Bryce,97.0
6,6.0,sarang,sub6,87.0,4.0,Bryce,97.0


In [13]:
data=pd.merge(df,df1,on='subject_id',how='inner')       #innerjoin
data

Unnamed: 0,MyUnknownColumn_x,Name_x,subject_id,Marks_scored_x,MyUnknownColumn_y,Name_y,Marks_scored_y
0,2,Amy,sub2,90,1,Billy,89
1,3,Allen,sub4,87,2,Brian,80
2,4,Alice,sub6,69,4,Bryce,97
3,5,Ayoung,sub5,78,5,Betty,88
4,6,sarang,sub6,87,4,Bryce,97


In [14]:
data=pd.merge(df,df1,on='subject_id',how='left')       #leftjoin
data

Unnamed: 0,MyUnknownColumn_x,Name_x,subject_id,Marks_scored_x,MyUnknownColumn_y,Name_y,Marks_scored_y
0,1,Alex,sub1,98,,,
1,2,Amy,sub2,90,1.0,Billy,89.0
2,3,Allen,sub4,87,2.0,Brian,80.0
3,4,Alice,sub6,69,4.0,Bryce,97.0
4,5,Ayoung,sub5,78,5.0,Betty,88.0
5,6,sarang,sub6,87,4.0,Bryce,97.0


In [15]:
data=pd.merge(df,df1,on='subject_id',how='right')       #rightjoin
data

Unnamed: 0,MyUnknownColumn_x,Name_x,subject_id,Marks_scored_x,MyUnknownColumn_y,Name_y,Marks_scored_y
0,2.0,Amy,sub2,90.0,1,Billy,89
1,3.0,Allen,sub4,87.0,2,Brian,80
2,,,sub3,,3,Bran,79
3,4.0,Alice,sub6,69.0,4,Bryce,97
4,6.0,sarang,sub6,87.0,4,Bryce,97
5,5.0,Ayoung,sub5,78.0,5,Betty,88


In [16]:
# Suffix to add to overlapping column names from the left,right DataFrame

data=df.join(df1,lsuffix='_left',rsuffix='_right')
data

Unnamed: 0,MyUnknownColumn_left,Name_left,subject_id_left,Marks_scored_left,MyUnknownColumn_right,Name_right,subject_id_right,Marks_scored_right
0,1,Alex,sub1,98,1.0,Billy,sub2,89.0
1,2,Amy,sub2,90,2.0,Brian,sub4,80.0
2,3,Allen,sub4,87,3.0,Bran,sub3,79.0
3,4,Alice,sub6,69,4.0,Bryce,sub6,97.0
4,5,Ayoung,sub5,78,5.0,Betty,sub5,88.0
5,6,sarang,sub6,87,,,,


3. **COMBINE METHOD**

* The combine() method in a pandas DataFrame is used to combine values from two DataFrames using a specified function. It compares elements from both DataFrames and applies the function to determine the resulting values.
      Combine two DataFrames columnwise, and return the largest column

In [19]:
import pandas as pd

data=([[1001,'nandesawri',35000,3434,65767],[1002,'molt',38000,45255,4525],[1003,'rena',37000,24525,5245]])
df=pd.DataFrame(data,columns=['a','b','c','d','e'])
df

Unnamed: 0,a,b,c,d,e
0,1001,nandesawri,35000,3434,65767
1,1002,molt,38000,45255,4525
2,1003,rena,37000,24525,5245


In [18]:
data2=([[1001,'nandesawri',35000,3434,65767],[1002,'molt',38000,45255,4525],[1003,'rena',37000,24525,5245]])
df2=pd.DataFrame(data2,columns=['a','b','c','d','e'])
df2

Unnamed: 0,a,b,c,d,e
0,1001,nandesawri,35000,3434,65767
1,1002,molt,38000,45255,4525
2,1003,rena,37000,24525,5245


In [20]:
res=df.combine(df2,lambda x,y:x+y)
res

Unnamed: 0,a,b,c,d,e
0,2002,nandesawrinandesawri,70000,6868,131534
1,2004,moltmolt,76000,90510,9050
2,2006,renarena,74000,49050,10490
