## Table of Content

1. **[Pandas](#pandas)**

   <img align="center" src="https://habrastorage.org/files/10c/15f/f3d/10c15ff3dcb14abdbabdac53fed6d825.jpg"  width=50% />


2. **[Data Structures](#structures)**
    
3. **[Pandas Series](#series)**
    - 3.1 - [Creating a Series](#creatingS)
    - 3.2 - [Manipulating Series](#manipulatingS)

4. **[Pandas Dataframes](#dataframes)**
    - 4.1 - [Creating Dataframes](#creatingDF)
    - Dataframe Methods
    - 4.2 - [Manipulating Dataframes](#manipulatingDF)

5. **[Merge Dataframe](#reading_data)**

6. **[Groupby Method](#reading_data)**

<a id="pandas"> </a>
## 1. Pandas
#### Introduction to Pandas

<table align="left">
    <tr>
        <td>
            <div align="left", style="font-size:120%">
                <font color="#21618C">
                    <b> Pandas contain data structures and data manipulation tools designed for data cleaning and analysis.
<br><br>
                        While Pandas adopt many coding idioms from Numpy, the biggest difference is that Pandas are designed for working with tabular or heterogeneous data. Numpy, by contrast, is best suited for working with homogeneous numerical array data.<br><br>
                         The name Pandas is derived from the term “panel data”, an econometrics term for multidimensional structured data sets.
                    </b>
                </font>
            </div>
        </td>
    </tr>
</table>

**How to install pandas?**<br>
1. You can use-<br>
`!pip install pandas`<br>
2. You can import it as-<br>
import pandas as pd

In [2]:
pip install pandas


Collecting pandas
  Downloading pandas-2.3.3-cp313-cp313-win_amd64.whl.metadata (19 kB)
Collecting pytz>=2020.1 (from pandas)
  Downloading pytz-2025.2-py2.py3-none-any.whl.metadata (22 kB)
Collecting tzdata>=2022.7 (from pandas)
  Downloading tzdata-2025.2-py2.py3-none-any.whl.metadata (1.4 kB)
Downloading pandas-2.3.3-cp313-cp313-win_amd64.whl (11.0 MB)
   ---------------------------------------- 0.0/11.0 MB ? eta -:--:--
    --------------------------------------- 0.3/11.0 MB ? eta -:--:--
   --- ------------------------------------ 1.0/11.0 MB 4.0 MB/s eta 0:00:03
   ------ --------------------------------- 1.8/11.0 MB 3.6 MB/s eta 0:00:03
   --------- ------------------------------ 2.6/11.0 MB 3.7 MB/s eta 0:00:03
   ------------ --------------------------- 3.4/11.0 MB 3.7 MB/s eta 0:00:03
   ---------------- ----------------------- 4.5/11.0 MB 4.0 MB/s eta 0:00:02
   -------------------- ------------------- 5.5/11.0 MB 4.1 MB/s eta 0:00:02
   ----------------------- -------------

In [1]:
import pandas as pd
import numpy as np

<a id="structures"> </a>
## 2. Data Structures
#### Introduction to Data Structures

Pandas has two data structures as follows:<br>
1. A Series is 1-dimensional labeled array that can hold data of any type (integer, string, boolean, float, python objects, and so on). It’s axis labels are collectively called an index.<br>
2. A DataFrame is 2-dimensional labeled data structure with columns. It supports multiple datatypes.

<a id="series"> </a>
## 3. Pandas Series
#### Introduction to Pandas Series and Creating Series

Pandas Series is a one-dimensional labeled array capable of holding any data type. 
                        It will assign a labeled index to each item in the Series. By default, each item will receive an index label from 0 to N, where N is the length of the Series minus one.

<a id="creatingS"> </a>
### 3.1 Creating a Series

**1. To create a numeric series** 

In [8]:
# create a numeric series
numbers = range(1,100,5)
print(numbers)
a = pd.Series(numbers)
type(a)


range(1, 100, 5)


pandas.core.series.Series

The output also gives the data type of the series as `int64`

And note that by default, each item will receive an index label from 0 to N, where N is the length of the Series minus one.

In [9]:
# create a Series with an list
s = pd.Series([345, 'London', 34.5, -34.45, 'Happy Birthday'])
s

0               345
1            London
2              34.5
3            -34.45
4    Happy Birthday
dtype: object

In [11]:
marks = [60, 89, 74, 86]

subject = ["Maths", "Science", "English" , "Social Science"]

mrk= pd.Series(marks,index=subject) 
print(mrk)

Maths             60
Science           89
English           74
Social Science    86
dtype: int64


In [9]:
#mrk["Maths"]

In [14]:
# create with dictionary
data = {'Maths': 60, 'Science': 89, 'English': 76, 'Social Science': 86}

pd.Series(data)

Maths             60
Science           89
English           76
Social Science    86
dtype: int64

In [13]:
# with missing values
subjects = ["Maths", "Science", "English","Art and Craft" , "Social Science"]

marks_series = pd.Series(data, index = subjects)

print(marks_series)

NameError: name 'data' is not defined

<a id="manipulatingS"> </a>
### 3.2 Manipulating Series 
#### Manipulating series

######  check null value

In [12]:
#  check null value
marks_series.isnull().sum()

NameError: name 'marks_series' is not defined

In [19]:
marks_series.notnull()

Maths              True
Science            True
English            True
Art and Craft     False
Social Science     True
dtype: bool

**To know the subjects in which marks score is more than 75**

In [20]:
marks_series[marks_series > 75]

Science           89.0
English           76.0
Social Science    86.0
dtype: float64

**To assign 68 marks to 'Art and Craft'**

In [21]:
marks_series

Maths             60.0
Science           89.0
English           76.0
Art and Craft      NaN
Social Science    86.0
dtype: float64

In [22]:
marks_series["Art and Craft"] = 68
marks_series

Maths             60.0
Science           89.0
English           76.0
Art and Craft     68.0
Social Science    86.0
dtype: float64

In [24]:
marks_series[marks_series<70] = 65
marks_series

Maths             65.0
Science           89.0
English           76.0
Art and Craft     65.0
Social Science    86.0
dtype: float64

In [25]:
# check math marks is?
marks_series.Maths == 73

False

In [26]:
# or you may use

marks_series["Maths"] == 73

False

**Sorting a numeric series**

In [27]:
# create a pandas series
import numpy as np
values = pd.Series([23, 45, np.nan, 41, 23, 34, 55, np.nan, 34, 20])
values

0    23.0
1    45.0
2     NaN
3    41.0
4    23.0
5    34.0
6    55.0
7     NaN
8    34.0
9    20.0
dtype: float64

In [28]:
# ascending order
values.sort_values(ascending = True)

9    20.0
0    23.0
4    23.0
5    34.0
8    34.0
3    41.0
1    45.0
6    55.0
2     NaN
7     NaN
dtype: float64

In [29]:
# descending order
values.sort_values(ascending = False)

6    55.0
1    45.0
3    41.0
5    34.0
8    34.0
0    23.0
4    23.0
9    20.0
2     NaN
7     NaN
dtype: float64

In [30]:
# create a pandas series
string_values = pd.Series(["a", "j", "d", "f", "t", "a"])

string_values
# sort categorical 

0    a
1    j
2    d
3    f
4    t
5    a
dtype: object

In [34]:
# ascending order
#reset_index()
string_values.sort_values(ascending = True)

0    a
5    a
2    d
3    f
1    j
4    t
dtype: object

In [35]:
# descending order
string_values.sort_values(ascending = False)

4    t
1    j
3    f
2    d
0    a
5    a
dtype: object

<a id="dataframes"> </a>
## 4. Pandas Dataframes
#### Introduction to Dataframes and Creating Dataframes

<table align="left">
    <tr>
        <td>
            <div align="left", style="font-size:120%">
                <font color="#21618C">
                    <b> A Dataframe is a tabular representation of data containing an ordered collection of columns, each of which can be a different type (numeric, string, boolean, and so on). <br><br>
                        The Dataframe has both a row and column index; it can be thought of as a dict of Series all sharing the same index. In a dataframe, the data is stored as one or more two-dimensional blocks rather than a list, dict, or some other collection of one-dimensional arrays. 
<br><br>
                        While a Dataframe is physically two-dimensional, it can be used to represent higher dimensional data in a tabular format using hierarchical indexing.
                    </b>
                </font>
            </div>
        </td>
    </tr>
</table>

<a id="creatingDF"> </a>
### 4.1 Creating Dataframes

**1. Creating a dataframe a dictionary**

In [36]:
data = {'Subject': ['Maths', 'History', 'Science', 'English', 'Georaphy', 'Art'],
        'Marks': (45, 65, 78,65, 80, 78),
        'CGPA': [2.5, 3.0, 3.5, 2.0, 4.0, 4.0]}

df = pd.DataFrame(data)
print(df)

    Subject  Marks  CGPA
0     Maths     45   2.5
1   History     65   3.0
2   Science     78   3.5
3   English     65   2.0
4  Georaphy     80   4.0
5       Art     78   4.0


**Note:** Like Series, the resulting Dataframe is assigned index automatically. The 'Marks' values are in a tuple. 

**2. To create dataframe from series**

In [15]:
Subject = pd.Series(['Maths', 'History', 'Science', 'English', 'Georaphy', 'Art'])
Marks = pd.Series([45, 65, 78, 65, 80, 78])
CGPA = pd.Series([2.5, 3.0, 3.5, 2.0, 4.0, 4.0])

In [39]:
#,index=['Subject','Marks','CGPA']).T
pd.DataFrame([Subject,Marks,CGPA])

Unnamed: 0,0,1,2,3,4,5
0,Maths,History,Science,English,Georaphy,Art
1,45,65,78,65,80,78
2,2.5,3.0,3.5,2.0,4.0,4.0


**3. To create dataframe from lists**

In [40]:
Subject = ['Maths', 'History', 'Science', 'English', 'Georaphy', 'Art']
Marks = [45, 65, 78, 65, 80, 78]
CGPA = [2.5, 3.0, 3.5, 2.0, 4.0, 4.0]

In [41]:
d=pd.DataFrame([Subject,Marks,CGPA], index = ['Subject','Marks','CGPA']).T
d

Unnamed: 0,Subject,Marks,CGPA
0,Maths,45,2.5
1,History,65,3.0
2,Science,78,3.5
3,English,65,2.0
4,Georaphy,80,4.0
5,Art,78,4.0


In [42]:
print(d[['CGPA','Subject']])

  CGPA   Subject
0  2.5     Maths
1  3.0   History
2  3.5   Science
3  2.0   English
4  4.0  Georaphy
5  4.0       Art


### To read data from csv file**

In [43]:
data = pd.read_csv("Data.csv")
data


Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
4,Germany,40.0,,Yes
5,France,35.0,58000.0,Yes
6,Spain,,52000.0,No
7,France,48.0,79000.0,Yes
8,Germany,50.0,83000.0,No
9,France,37.0,67000.0,Yes


In [44]:
type(data)

pandas.core.frame.DataFrame

In [45]:
print(data)

   Country   Age   Salary Purchased
0   France  44.0  72000.0        No
1    Spain  27.0  48000.0       Yes
2  Germany  30.0  54000.0        No
3    Spain  38.0  61000.0        No
4  Germany  40.0      NaN       Yes
5   France  35.0  58000.0       Yes
6    Spain   NaN  52000.0        No
7   France  48.0  79000.0       Yes
8  Germany  50.0  83000.0        No
9   France  37.0  67000.0       Yes


In [46]:
data.tail()

Unnamed: 0,Country,Age,Salary,Purchased
5,France,35.0,58000.0,Yes
6,Spain,,52000.0,No
7,France,48.0,79000.0,Yes
8,Germany,50.0,83000.0,No
9,France,37.0,67000.0,Yes


In [73]:
#head()
#data.shape
#data.isnull().sum()
#print(data.info())
#data.describe()
#data.Country.unique()
#data.Country.nunique()
#type(data.Age)
#type(data['Age'])
#data["Age","Country"]
#data.Age.sum()#,.min(),.max(),.value_counts()

<a id="manipulatingDF"> </a>
### 4.2  Manipulating Dataframes 
#### Manipulating the Dataframes

### Add new column and rows

<table align="left">
    <tr>
        <td>
            <div align="left", style="font-size:120%">
                <font color="#21618C">
                    <b> CAUTION:<br>
                        1. DataFrame[column] works for any column name, but DataFrame.column only works when the column name is a valid Python variable name.<br>
                    </b>
                </font>
            </div>
        </td>
    </tr>
</table>

**Adding a new column to the data set**

In [74]:
data["New"] = data["Age"] / data["Salary"]

In [75]:
data

Unnamed: 0,Country,Age,Salary,Purchased,New
0,France,44.0,72000.0,No,0.000611
1,Spain,27.0,48000.0,Yes,0.000562
2,Germany,30.0,54000.0,No,0.000556
3,Spain,38.0,61000.0,No,0.000623
4,Germany,40.0,,Yes,
5,France,35.0,58000.0,Yes,0.000603
6,Spain,,52000.0,No,
7,France,48.0,79000.0,Yes,0.000608
8,Germany,50.0,83000.0,No,0.000602
9,France,37.0,67000.0,Yes,0.000552


In [76]:
data['C']=[i for i in range(0,10)]
data

Unnamed: 0,Country,Age,Salary,Purchased,New,C
0,France,44.0,72000.0,No,0.000611,0
1,Spain,27.0,48000.0,Yes,0.000562,1
2,Germany,30.0,54000.0,No,0.000556,2
3,Spain,38.0,61000.0,No,0.000623,3
4,Germany,40.0,,Yes,,4
5,France,35.0,58000.0,Yes,0.000603,5
6,Spain,,52000.0,No,,6
7,France,48.0,79000.0,Yes,0.000608,7
8,Germany,50.0,83000.0,No,0.000602,8
9,France,37.0,67000.0,Yes,0.000552,9


In [77]:
# Insert column 'D' at position 1
data.insert(1, 'D', [i for i in range(10,20)])
data

Unnamed: 0,Country,D,Age,Salary,Purchased,New,C
0,France,10,44.0,72000.0,No,0.000611,0
1,Spain,11,27.0,48000.0,Yes,0.000562,1
2,Germany,12,30.0,54000.0,No,0.000556,2
3,Spain,13,38.0,61000.0,No,0.000623,3
4,Germany,14,40.0,,Yes,,4
5,France,15,35.0,58000.0,Yes,0.000603,5
6,Spain,16,,52000.0,No,,6
7,France,17,48.0,79000.0,Yes,0.000608,7
8,Germany,18,50.0,83000.0,No,0.000602,8
9,France,19,37.0,67000.0,Yes,0.000552,9


In [16]:
data['C']=100

In [84]:
data

Unnamed: 0,Country,D,Age,Salary,Purchased,New,C
0,France,10,44.0,72000.0,No,0.000611,100
1,Spain,11,27.0,48000.0,Yes,0.000562,100
2,Germany,12,30.0,54000.0,No,0.000556,100
3,Spain,13,38.0,61000.0,No,0.000623,100
4,Germany,14,40.0,,Yes,,100
5,France,15,35.0,58000.0,Yes,0.000603,100
6,Spain,16,,52000.0,No,,100
7,France,17,48.0,79000.0,Yes,0.000608,100
8,Germany,18,50.0,83000.0,No,0.000602,100
9,France,19,37.0,67000.0,Yes,0.000552,100


**Adding a new row to the data set**

A new row can be added using the function copy()

In [17]:
data_copy = data.copy()
data_copy.loc[10] = [45, 85, 1.8, 26.3,23.7,44]

AttributeError: 'dict' object has no attribute 'loc'

In [92]:
data

Unnamed: 0,Country,D,Age,Salary,Purchased,New,C
0,France,10,44.0,72000.0,No,0.000611,100
1,Spain,11,27.0,48000.0,Yes,0.000562,100
2,Germany,12,30.0,54000.0,No,0.000556,100
3,Spain,13,38.0,61000.0,No,0.000623,100
4,Germany,14,40.0,,Yes,,100
5,France,15,35.0,58000.0,Yes,0.000603,100
6,Spain,16,,52000.0,No,,100
7,France,17,48.0,79000.0,Yes,0.000608,100
8,Germany,18,50.0,83000.0,No,0.000602,100
9,France,19,37.0,67000.0,Yes,0.000552,100


In [93]:
data_copy

Unnamed: 0,Country,D,Age,Salary,Purchased,New,C
0,France,10,44.0,72000.0,No,0.000611,100
1,Spain,11,27.0,48000.0,Yes,0.000562,100
2,Germany,12,30.0,54000.0,No,0.000556,100
3,Spain,13,38.0,61000.0,No,0.000623,100
4,Germany,14,40.0,,Yes,,100
5,France,15,35.0,58000.0,Yes,0.000603,100
6,Spain,16,,52000.0,No,,100
7,France,17,48.0,79000.0,Yes,0.000608,100
8,Germany,18,50.0,83000.0,No,0.000602,100
9,France,19,37.0,67000.0,Yes,0.000552,100


### Selecting columns

**Indexing a dataframe using `.iloc`**

`DataFrame.iloc[]` 

In [94]:
# Select 2nd row
data.iloc[2]

Country       Germany
D                  12
Age              30.0
Salary        54000.0
Purchased          No
New          0.000556
C                 100
Name: 2, dtype: object

In [95]:
# select multiple rows
data.iloc[[4,7,9]]

Unnamed: 0,Country,D,Age,Salary,Purchased,New,C
4,Germany,14,40.0,,Yes,,100
7,France,17,48.0,79000.0,Yes,0.000608,100
9,France,19,37.0,67000.0,Yes,0.000552,100


We use two square brackets since we are passing a list of row numbers to be accessed.

In [96]:
# select 5th to 8th row
data.iloc[5:9]

Unnamed: 0,Country,D,Age,Salary,Purchased,New,C
5,France,15,35.0,58000.0,Yes,0.000603,100
6,Spain,16,,52000.0,No,,100
7,France,17,48.0,79000.0,Yes,0.000608,100
8,Germany,18,50.0,83000.0,No,0.000602,100


In [100]:
#select column
data.iloc[:,]#,0,1,-1

Unnamed: 0,Country,D,Age,Salary,Purchased,New,C
0,France,10,44.0,72000.0,No,0.000611,100
1,Spain,11,27.0,48000.0,Yes,0.000562,100
2,Germany,12,30.0,54000.0,No,0.000556,100
3,Spain,13,38.0,61000.0,No,0.000623,100
4,Germany,14,40.0,,Yes,,100
5,France,15,35.0,58000.0,Yes,0.000603,100
6,Spain,16,,52000.0,No,,100
7,France,17,48.0,79000.0,Yes,0.000608,100
8,Germany,18,50.0,83000.0,No,0.000602,100
9,France,19,37.0,67000.0,Yes,0.000552,100


**Indexing a dataframe using `.loc`**

`DataFrame.loc[]` method is a method that takes only index labels and returns row or dataframe if the index label exists in the caller dataframe. <br>
`DataFrame.loc[Row_names, column_names]` is used to select or index rows or columns based on their name.

#### Select 1 to 5 rows and 2nd and 4th columns

In [101]:
data.loc[0:5,["Age","Salary"]]

Unnamed: 0,Age,Salary
0,44.0,72000.0
1,27.0,48000.0
2,30.0,54000.0
3,38.0,61000.0
4,40.0,
5,35.0,58000.0


### To check for missing values and Imputation**

In [105]:
data.isnull()#.sum()

Unnamed: 0,Country,D,Age,Salary,Purchased,New,C
0,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False
4,False,False,False,True,False,True,False
5,False,False,False,False,False,False,False
6,False,False,True,False,False,True,False
7,False,False,False,False,False,False,False
8,False,False,False,False,False,False,False
9,False,False,False,False,False,False,False


In [103]:
#data.dropna(axis=1)

In [104]:
#data.fillna({"Age":5,"Salary":0})

In [111]:
#df.fillna(method="ffill",axis=0)

In [113]:
#df.fillna(0)

In [102]:
# Fill column A with its mean
#df["A"] = df["A"].fillna(df["A"].mean())
# Fill column B with its median
#df["B"] = df["B"].fillna(df["B"].median())

In [106]:
data

Unnamed: 0,Country,D,Age,Salary,Purchased,New,C
0,France,10,44.0,72000.0,No,0.000611,100
1,Spain,11,27.0,48000.0,Yes,0.000562,100
2,Germany,12,30.0,54000.0,No,0.000556,100
3,Spain,13,38.0,61000.0,No,0.000623,100
4,Germany,14,40.0,,Yes,,100
5,France,15,35.0,58000.0,Yes,0.000603,100
6,Spain,16,,52000.0,No,,100
7,France,17,48.0,79000.0,Yes,0.000608,100
8,Germany,18,50.0,83000.0,No,0.000602,100
9,France,19,37.0,67000.0,Yes,0.000552,100


In [107]:
data['Age'] = data['Age'].fillna(data['Age'].mean())

In [108]:
data

Unnamed: 0,Country,D,Age,Salary,Purchased,New,C
0,France,10,44.0,72000.0,No,0.000611,100
1,Spain,11,27.0,48000.0,Yes,0.000562,100
2,Germany,12,30.0,54000.0,No,0.000556,100
3,Spain,13,38.0,61000.0,No,0.000623,100
4,Germany,14,40.0,,Yes,,100
5,France,15,35.0,58000.0,Yes,0.000603,100
6,Spain,16,38.777778,52000.0,No,,100
7,France,17,48.0,79000.0,Yes,0.000608,100
8,Germany,18,50.0,83000.0,No,0.000602,100
9,France,19,37.0,67000.0,Yes,0.000552,100


### Delete rows and columns

In [109]:
# Delete column 'B'
df=data.drop(columns=['D','C'])
df

Unnamed: 0,Country,Age,Salary,Purchased,New
0,France,44.0,72000.0,No,0.000611
1,Spain,27.0,48000.0,Yes,0.000562
2,Germany,30.0,54000.0,No,0.000556
3,Spain,38.0,61000.0,No,0.000623
4,Germany,40.0,,Yes,
5,France,35.0,58000.0,Yes,0.000603
6,Spain,38.777778,52000.0,No,
7,France,48.0,79000.0,Yes,0.000608
8,Germany,50.0,83000.0,No,0.000602
9,France,37.0,67000.0,Yes,0.000552


#### delete row

In [110]:
# Delete row with index 1
df = df.drop(index=9)
df

Unnamed: 0,Country,Age,Salary,Purchased,New
0,France,44.0,72000.0,No,0.000611
1,Spain,27.0,48000.0,Yes,0.000562
2,Germany,30.0,54000.0,No,0.000556
3,Spain,38.0,61000.0,No,0.000623
4,Germany,40.0,,Yes,
5,France,35.0,58000.0,Yes,0.000603
6,Spain,38.777778,52000.0,No,
7,France,48.0,79000.0,Yes,0.000608
8,Germany,50.0,83000.0,No,0.000602


#### Update rows

In [112]:
df.loc[1] = [100, 200, 300,444,78]  # Update row with index 1
df

Unnamed: 0,Country,Age,Salary,Purchased,New
0,France,44.0,72000.0,No,0.000611
1,100,200.0,300.0,444,78.0
2,Germany,30.0,54000.0,No,0.000556
3,Spain,38.0,61000.0,No,0.000623
4,Germany,40.0,,Yes,
5,France,35.0,58000.0,Yes,0.000603
6,Spain,38.777778,52000.0,No,
7,France,48.0,79000.0,Yes,0.000608
8,Germany,50.0,83000.0,No,0.000602


In [113]:
df['Country'] = df['Country'].replace(100, 'Pakistan')  # Replace all 2s with 999 in column 'A'
df

Unnamed: 0,Country,Age,Salary,Purchased,New
0,France,44.0,72000.0,No,0.000611
1,Pakistan,200.0,300.0,444,78.0
2,Germany,30.0,54000.0,No,0.000556
3,Spain,38.0,61000.0,No,0.000623
4,Germany,40.0,,Yes,
5,France,35.0,58000.0,Yes,0.000603
6,Spain,38.777778,52000.0,No,
7,France,48.0,79000.0,Yes,0.000608
8,Germany,50.0,83000.0,No,0.000602


In [114]:
df.iat[1, 2] = 399  # index at .Update row 0 (first row), column 1 (second column)
df

Unnamed: 0,Country,Age,Salary,Purchased,New
0,France,44.0,72000.0,No,0.000611
1,Pakistan,200.0,399.0,444,78.0
2,Germany,30.0,54000.0,No,0.000556
3,Spain,38.0,61000.0,No,0.000623
4,Germany,40.0,,Yes,
5,France,35.0,58000.0,Yes,0.000603
6,Spain,38.777778,52000.0,No,
7,France,48.0,79000.0,Yes,0.000608
8,Germany,50.0,83000.0,No,0.000602


Note:

.drop() returns a new DataFrame unless you use inplace=True.
To reset the index after deleting rows, use

In [115]:
#### Drop duplicated data

In [116]:
# Create a DataFrame
d = {
    'Name':['Alisa','Bobby','jodha','jack','raghu','Cathrine',
            'Alisa','Bobby','kumar','Alisa','Alex','Cathrine'],
    'Company':['Apple','Walmart','Walmart','Intel','Apple','Walmart','Apple','Cognizant','Apple','Apple','Cognizant','Walmart'],
      
       'Salary':[8500,6300,5500,7400,3100,7700,8500,6300,4200,6200,8900,7700]}
 
df = pd.DataFrame(d,columns=['Name','Company','Salary'])
df

Unnamed: 0,Name,Company,Salary
0,Alisa,Apple,8500
1,Bobby,Walmart,6300
2,jodha,Walmart,5500
3,jack,Intel,7400
4,raghu,Apple,3100
5,Cathrine,Walmart,7700
6,Alisa,Apple,8500
7,Bobby,Cognizant,6300
8,kumar,Apple,4200
9,Alisa,Apple,6200


**duplicate rows based on all columns**

In [117]:
# Select duplicate rows except first occurrence based on all columns
duplicate_df = df[df.duplicated()]
print("Duplicate Rows except first occurrence based on all columns are :")
print(duplicate_df)

Duplicate Rows except first occurrence based on all columns are :
        Name  Company  Salary
6      Alisa    Apple    8500
11  Cathrine  Walmart    7700


**Duplicate Rows Based on Selected Columns**

Let’s find & select rows based on a single column.

In [118]:
duplicate_df = df[df.duplicated('Name')]
print(duplicate_df)

        Name    Company  Salary
6      Alisa      Apple    8500
7      Bobby  Cognizant    6300
9      Alisa      Apple    6200
11  Cathrine    Walmart    7700


Find & select rows based on a two column names.

In [119]:
duplicate_df = df[df.duplicated(['Name', 'Company'])]
print(duplicate_df)

        Name  Company  Salary
6      Alisa    Apple    8500
9      Alisa    Apple    6200
11  Cathrine  Walmart    7700


**Dropping the Duplicate Rows**

In [120]:
df.drop_duplicates()

Unnamed: 0,Name,Company,Salary
0,Alisa,Apple,8500
1,Bobby,Walmart,6300
2,jodha,Walmart,5500
3,jack,Intel,7400
4,raghu,Apple,3100
5,Cathrine,Walmart,7700
7,Bobby,Cognizant,6300
8,kumar,Apple,4200
9,Alisa,Apple,6200
10,Alex,Cognizant,8900


In [121]:
#data.to_csv("new_data.csv")

#### .apply()

In [122]:
import pandas as pd
# Sample DataFrame
df = pd.DataFrame({
    'A': [1, 2, 3],
    'B': [4, 5, 6]
})
# Apply a function to double the values in column 'A'
df['A'] = df['A'].apply(lambda x: x * 2)
print(df)

   A  B
0  2  4
1  4  5
2  6  6


<a id="Groupby"> </a>
### 5. Groupby in Pandas
groupby in pandas is one of the most powerful tools, it lets you split → apply → combine data.

You can think of it like this:
    
    Split the data into groups based on some column.
    
    Apply a function (mean, sum, count, median, custom function, etc.) to each group.
    
    Combine the results back into a DataFrame or Series.

In [127]:
# create a dataframe
my_df1=pd.DataFrame({
 'Product_ID':[101,102,103,104,105,106],
 'Food_Product':['Cakes','Biscuits','Fruit','Beverages','Cakes','Beverages'],
 'Brand':['Baskin Robbins','Blue Riband','Peach','Horlicks','Mars Muffin','Mirinda'],
 'Sales': [5000, 8000, 7600, 5500, 6500, 9000],
 'Profit': [55000, 67000, 89000, 78000, 55000, 90000]   
})
print(my_df1)

   Product_ID Food_Product           Brand  Sales  Profit
0         101        Cakes  Baskin Robbins   5000   55000
1         102     Biscuits     Blue Riband   8000   67000
2         103        Fruit           Peach   7600   89000
3         104    Beverages        Horlicks   5500   78000
4         105        Cakes     Mars Muffin   6500   55000
5         106    Beverages         Mirinda   9000   90000


In [128]:
my_df1

Unnamed: 0,Product_ID,Food_Product,Brand,Sales,Profit
0,101,Cakes,Baskin Robbins,5000,55000
1,102,Biscuits,Blue Riband,8000,67000
2,103,Fruit,Peach,7600,89000
3,104,Beverages,Horlicks,5500,78000
4,105,Cakes,Mars Muffin,6500,55000
5,106,Beverages,Mirinda,9000,90000


**total sales of each food product**

Turn the groupby object into a regular dataframe by calling `.to_frame()` and then re-index with `reset_index()`, then you can also call sort_values() as you would do for a normal dataframe.

In [137]:
my_df1.groupby('Food_Product')['Sales'].sum()#.to_frame().reset_index()

Food_Product
Beverages    14500
Biscuits      8000
Cakes        11500
Fruit         7600
Name: Sales, dtype: int64

In [139]:
my_df1.groupby('Food_Product')['Sales'].sum().to_frame().reset_index().sort_values(by='Sales')

Unnamed: 0,Food_Product,Sales
3,Fruit,7600
1,Biscuits,8000
2,Cakes,11500
0,Beverages,14500


**Hierarchical Indices Created by Groupby**

In [144]:
my_df1.groupby('Food_Product').agg({'Sales':['min','max','mean']})#.reset_index()

Unnamed: 0_level_0,Sales,Sales,Sales
Unnamed: 0_level_1,min,max,mean
Food_Product,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
Beverages,5500,9000,7250.0
Biscuits,8000,8000,8000.0
Cakes,5000,6500,5750.0
Fruit,7600,7600,7600.0


In [146]:
#my_df1.groupby('Food_Product')['Sales'].agg(['min','max','mean'])#.reset_index()

In [147]:
#flatten 
result = my_df1.groupby('Food_Product').agg({'Sales': ['min', 'max', 'mean']}).reset_index()
result.columns = ['Food_Product', 'Min_Sales', 'Max_Sales', 'Mean_Sales']
print(result)

  Food_Product  Min_Sales  Max_Sales  Mean_Sales
0    Beverages       5500       9000      7250.0
1     Biscuits       8000       8000      8000.0
2        Cakes       5000       6500      5750.0
3        Fruit       7600       7600      7600.0


In [148]:
result = my_df1.groupby('Food_Product').agg(
    Min_Sales=('Sales', 'min'),
    Max_Sales=('Sales', 'max'),
    Avg_Sales=('Sales', 'mean')
).reset_index()

In [149]:
result

Unnamed: 0,Food_Product,Min_Sales,Max_Sales,Avg_Sales
0,Beverages,5500,9000,7250.0
1,Biscuits,8000,8000,8000.0
2,Cakes,5000,6500,5750.0
3,Fruit,7600,7600,7600.0
