# Introduction To Pandas üêº

- Pandas is a popular Python library used for data manipulation and analysis. It makes working with structured data (like tables, CSVs, Excel files, SQL data) very easy.

- It leverages speed and power of NumPy to make data analysis and preprocessing easy

## 1. Data Structures

- Series: 1D labeled array (like a single column).

- DataFrame: 2D labeled table (rows √ó columns, like a spreadsheet).

## 2. Data Operations

- Reading/writing data: CSV, Excel, SQL, JSON.

- Selecting, filtering, and slicing data.

- Handling missing data (NaN) with methods like dropna() and fillna().

- Aggregations: sum, mean, count, groupby operations.

= Sorting, merging, joining, and reshaping datasets.

In [4]:
import numpy as np
import pandas as pd

## 1. DataFrame (df)

- Consider it simply like excel sheets 

In [5]:
dict1 = {
    "name" : ["sidra", "harry", "shubh","skillf"],
    "marks" : [92, 43, 24, 17],
    "city" : ["amsterdam", "london", "paris", "japan"]
}

#dataframe is simply like an excel sheet 

df = pd.DataFrame(dict1)

In [6]:
df

Unnamed: 0,name,marks,city
0,sidra,92,amsterdam
1,harry,43,london
2,shubh,24,paris
3,skillf,17,japan


## 2. Uploading Data On A CSV File?

- For this, we simply use dot to_csv function

- It uploads data on a csv file easily

- Later we can manipulate, analyze and use data to our will

## 3. Removing Indices From Data?

- For this we can define index=False that removes indices/numbering from data

In [7]:
#we can also remove indices using index = False

df.to_csv("friends.csv", index=False)

## 4. Displaying First Number Of Rows?

- For this, we use df.head(no of rows we want to see)

In [8]:
#we can also see the first number of rows using df.head()

df.head(2)


Unnamed: 0,name,marks,city
0,sidra,92,amsterdam
1,harry,43,london


## 5. Displaying Last Number Of Rows?

- We can display last number of rows in pandas using df.tail(any number of rows)

In [9]:
#we can also see the last number of rows using df.tail()

df.tail(2)

Unnamed: 0,name,marks,city
2,shubh,24,paris
3,skillf,17,japan


## 6. Finding Numerical Stats?

- For this, we can simply use df.describe() to obtain numerical stats of numerical columns 

In [10]:
#we can also check all numerical statistics usinf df.describe()

df.describe()

Unnamed: 0,marks
count,4.0
mean,44.0
std,33.832923
min,17.0
25%,22.25
50%,33.5
75%,55.25
max,92.0


## 7. Reading A CSV File?


In [11]:
sidra = pd.read_csv("sidra.csv")

In [12]:
sidra

Unnamed: 0.7,Unnamed: 0.6,Unnamed: 0.5,Unnamed: 0.4,Unnamed: 0.3,Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,train no,speed,city
0,first,first,first,first,first,first,first,1238480,92,amsterdam
1,second,second,second,second,second,second,second,3213234,43,london
2,third,third,third,third,third,third,third,8094380,100,paris
3,fourth,fourth,fourth,fourth,fourth,fourth,fourth,213214,17,japan


## 8. Accesing A Column?

- We need to use the variable to which we have saved the file


In [13]:
sidra['speed']
sidra["city"]

0    amsterdam
1       london
2        paris
3        japan
Name: city, dtype: object

## 9. Accessing A Value In A Column?

- For this we need to use the index number while keeping the above format same

In [14]:
sidra['speed'][2]

np.int64(100)

## 10. Changing The Value In CSV File

- We use dot loc to pevent the caveats warning 

In [15]:
sidra.loc[2,'speed'] = 100


In [16]:
sidra

Unnamed: 0.7,Unnamed: 0.6,Unnamed: 0.5,Unnamed: 0.4,Unnamed: 0.3,Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,train no,speed,city
0,first,first,first,first,first,first,first,1238480,92,amsterdam
1,second,second,second,second,second,second,second,3213234,43,london
2,third,third,third,third,third,third,third,8094380,100,paris
3,fourth,fourth,fourth,fourth,fourth,fourth,fourth,213214,17,japan


In [17]:
sidra.to_csv("sidra.csv", index=False)

In [18]:
sidra

Unnamed: 0.7,Unnamed: 0.6,Unnamed: 0.5,Unnamed: 0.4,Unnamed: 0.3,Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,train no,speed,city
0,first,first,first,first,first,first,first,1238480,92,amsterdam
1,second,second,second,second,second,second,second,3213234,43,london
2,third,third,third,third,third,third,third,8094380,100,paris
3,fourth,fourth,fourth,fourth,fourth,fourth,fourth,213214,17,japan


## 11. Modifying Indices

- We can modify indices per our will using dot index

In [19]:
sidra.index = ["first", "second", "third", "fourth"]

In [20]:
sidra

Unnamed: 0.7,Unnamed: 0.6,Unnamed: 0.5,Unnamed: 0.4,Unnamed: 0.3,Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,train no,speed,city
first,first,first,first,first,first,first,first,1238480,92,amsterdam
second,second,second,second,second,second,second,second,3213234,43,london
third,third,third,third,third,third,third,third,8094380,100,paris
fourth,fourth,fourth,fourth,fourth,fourth,fourth,fourth,213214,17,japan


In [21]:
sidra.to_csv("sidra.csv")

## 1. Understanding Series

- It can a column or a row in entire dataframe 

In [22]:
ser = pd.Series(np.random.rand(10))

In [23]:
ser

0    0.317151
1    0.191680
2    0.789994
3    0.894321
4    0.248413
5    0.627324
6    0.097628
7    0.974685
8    0.109630
9    0.977978
dtype: float64

In [24]:
type(ser)

pandas.core.series.Series

## 2. Understanding DataFrame:

- It has multiple series in it 

In [25]:
newdf = pd.DataFrame(np.random.rand(334,5), index=np.arange(334))

In [26]:
newdf

Unnamed: 0,0,1,2,3,4
0,0.920558,0.331947,0.659455,0.121918,0.683760
1,0.388713,0.352354,0.684922,0.964196,0.528006
2,0.845545,0.190592,0.611373,0.905322,0.849372
3,0.343303,0.389242,0.981091,0.866423,0.158648
4,0.783519,0.984730,0.492155,0.042735,0.574345
...,...,...,...,...,...
329,0.965079,0.717539,0.751780,0.297048,0.390958
330,0.411239,0.004028,0.571285,0.718737,0.693596
331,0.610824,0.688725,0.223865,0.477368,0.494092
332,0.087642,0.439532,0.527287,0.743251,0.319952


In [27]:
type(newdf)

pandas.core.frame.DataFrame

In [28]:
newdf.dtypes

0    float64
1    float64
2    float64
3    float64
4    float64
dtype: object

In [29]:
newdf[0][0] = 0.3

You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.
A typical example is when you are setting values in a column of a DataFrame, like:

df["col"][row_indexer] = value

Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

  newdf[0][0] = 0.3


In [30]:
newdf

Unnamed: 0,0,1,2,3,4
0,0.300000,0.331947,0.659455,0.121918,0.683760
1,0.388713,0.352354,0.684922,0.964196,0.528006
2,0.845545,0.190592,0.611373,0.905322,0.849372
3,0.343303,0.389242,0.981091,0.866423,0.158648
4,0.783519,0.984730,0.492155,0.042735,0.574345
...,...,...,...,...,...
329,0.965079,0.717539,0.751780,0.297048,0.390958
330,0.411239,0.004028,0.571285,0.718737,0.693596
331,0.610824,0.688725,0.223865,0.477368,0.494092
332,0.087642,0.439532,0.527287,0.743251,0.319952


In [31]:
newdf.index

Index([  0,   1,   2,   3,   4,   5,   6,   7,   8,   9,
       ...
       324, 325, 326, 327, 328, 329, 330, 331, 332, 333],
      dtype='int64', length=334)

In [32]:
newdf.to_numpy()

array([[0.3       , 0.33194746, 0.65945494, 0.12191767, 0.68375966],
       [0.38871344, 0.35235432, 0.68492225, 0.96419565, 0.52800608],
       [0.84554496, 0.19059242, 0.61137331, 0.90532175, 0.84937156],
       ...,
       [0.61082447, 0.68872508, 0.22386517, 0.47736842, 0.49409167],
       [0.08764229, 0.43953162, 0.52728656, 0.74325061, 0.31995189],
       [0.52514024, 0.22889994, 0.37872244, 0.17328754, 0.42428074]])

In [33]:
newdf.sort_index(axis=1, ascending=False)

Unnamed: 0,4,3,2,1,0
0,0.683760,0.121918,0.659455,0.331947,0.300000
1,0.528006,0.964196,0.684922,0.352354,0.388713
2,0.849372,0.905322,0.611373,0.190592,0.845545
3,0.158648,0.866423,0.981091,0.389242,0.343303
4,0.574345,0.042735,0.492155,0.984730,0.783519
...,...,...,...,...,...
329,0.390958,0.297048,0.751780,0.717539,0.965079
330,0.693596,0.718737,0.571285,0.004028,0.411239
331,0.494092,0.477368,0.223865,0.688725,0.610824
332,0.319952,0.743251,0.527287,0.439532,0.087642


In [34]:
type(newdf[0])

pandas.core.series.Series

## View Behaviour Of DataFrames

- Any changes made to newdf will be applied on df as well since newdf is the view of original df 

In [35]:
newdf2 = newdf

In [36]:
newdf[0][0] = 95794

You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.
A typical example is when you are setting values in a column of a DataFrame, like:

df["col"][row_indexer] = value

Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

  newdf[0][0] = 95794


In [37]:
newdf

Unnamed: 0,0,1,2,3,4
0,95794.000000,0.331947,0.659455,0.121918,0.683760
1,0.388713,0.352354,0.684922,0.964196,0.528006
2,0.845545,0.190592,0.611373,0.905322,0.849372
3,0.343303,0.389242,0.981091,0.866423,0.158648
4,0.783519,0.984730,0.492155,0.042735,0.574345
...,...,...,...,...,...
329,0.965079,0.717539,0.751780,0.297048,0.390958
330,0.411239,0.004028,0.571285,0.718737,0.693596
331,0.610824,0.688725,0.223865,0.477368,0.494092
332,0.087642,0.439532,0.527287,0.743251,0.319952


## Hardcoding Original DataFrame To Dot Copy

- We can hardcode dot copy to original old df to prevent any changes made to the newdf being applied on it also

- Here newdf remains the same, but newdf2 does not since newdf has been chardocded with .copy()

In [38]:
newdf2 = newdf.copy()

In [39]:
newdf2[0][0] = 59870

You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.
A typical example is when you are setting values in a column of a DataFrame, like:

df["col"][row_indexer] = value

Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

  newdf2[0][0] = 59870


In [40]:
newdf2

Unnamed: 0,0,1,2,3,4
0,59870.000000,0.331947,0.659455,0.121918,0.683760
1,0.388713,0.352354,0.684922,0.964196,0.528006
2,0.845545,0.190592,0.611373,0.905322,0.849372
3,0.343303,0.389242,0.981091,0.866423,0.158648
4,0.783519,0.984730,0.492155,0.042735,0.574345
...,...,...,...,...,...
329,0.965079,0.717539,0.751780,0.297048,0.390958
330,0.411239,0.004028,0.571285,0.718737,0.693596
331,0.610824,0.688725,0.223865,0.477368,0.494092
332,0.087642,0.439532,0.527287,0.743251,0.319952


In [41]:
newdf

Unnamed: 0,0,1,2,3,4
0,95794.000000,0.331947,0.659455,0.121918,0.683760
1,0.388713,0.352354,0.684922,0.964196,0.528006
2,0.845545,0.190592,0.611373,0.905322,0.849372
3,0.343303,0.389242,0.981091,0.866423,0.158648
4,0.783519,0.984730,0.492155,0.042735,0.574345
...,...,...,...,...,...
329,0.965079,0.717539,0.751780,0.297048,0.390958
330,0.411239,0.004028,0.571285,0.718737,0.693596
331,0.610824,0.688725,0.223865,0.477368,0.494092
332,0.087642,0.439532,0.527287,0.743251,0.319952


## Avoiding Copy Warning

- We can simply use .loc() to avoid copy warning 

### 1. .loc:

- it helps us access rows and columns using their row/column numbers and index numbers both

In [42]:
newdf.loc[0,0] = 654
newdf.head(2)

Unnamed: 0,0,1,2,3,4
0,654.0,0.331947,0.659455,0.121918,0.68376
1,0.388713,0.352354,0.684922,0.964196,0.528006


In [43]:
newdf.columns = list("ABCDE")
newdf

Unnamed: 0,A,B,C,D,E
0,654.000000,0.331947,0.659455,0.121918,0.683760
1,0.388713,0.352354,0.684922,0.964196,0.528006
2,0.845545,0.190592,0.611373,0.905322,0.849372
3,0.343303,0.389242,0.981091,0.866423,0.158648
4,0.783519,0.984730,0.492155,0.042735,0.574345
...,...,...,...,...,...
329,0.965079,0.717539,0.751780,0.297048,0.390958
330,0.411239,0.004028,0.571285,0.718737,0.693596
331,0.610824,0.688725,0.223865,0.477368,0.494092
332,0.087642,0.439532,0.527287,0.743251,0.319952


In [44]:
newdf.loc[0,"A"] = 65445
newdf.head()

Unnamed: 0,A,B,C,D,E
0,65445.0,0.331947,0.659455,0.121918,0.68376
1,0.388713,0.352354,0.684922,0.964196,0.528006
2,0.845545,0.190592,0.611373,0.905322,0.849372
3,0.343303,0.389242,0.981091,0.866423,0.158648
4,0.783519,0.98473,0.492155,0.042735,0.574345


In [45]:
newdf.drop("E", axis=1)

Unnamed: 0,A,B,C,D
0,65445.000000,0.331947,0.659455,0.121918
1,0.388713,0.352354,0.684922,0.964196
2,0.845545,0.190592,0.611373,0.905322
3,0.343303,0.389242,0.981091,0.866423
4,0.783519,0.984730,0.492155,0.042735
...,...,...,...,...
329,0.965079,0.717539,0.751780,0.297048
330,0.411239,0.004028,0.571285,0.718737
331,0.610824,0.688725,0.223865,0.477368
332,0.087642,0.439532,0.527287,0.743251


In [46]:
newdf.loc[[1,2], ["C", "D"]]

Unnamed: 0,C,D
1,0.684922,0.964196
2,0.611373,0.905322


## Running Complex Query

- Running query for the data smaller than 0.3 and bigger than 0.1

In [47]:
newdf.loc[(newdf["A"]<0.3) & (newdf['C']>0.1 )]

Unnamed: 0,A,B,C,D,E
8,0.219102,0.350646,0.133662,0.706541,0.235013
9,0.250357,0.804715,0.194649,0.043584,0.582605
16,0.218157,0.164105,0.469095,0.608818,0.268392
20,0.243731,0.446674,0.774770,0.620787,0.131009
21,0.226523,0.468909,0.908942,0.504081,0.714795
...,...,...,...,...,...
307,0.149316,0.704958,0.914961,0.069316,0.729390
310,0.278223,0.881057,0.505287,0.421081,0.067906
316,0.076654,0.781046,0.797678,0.901081,0.896373
326,0.115180,0.632967,0.530314,0.541006,0.507931


### 2. .iloc:

- We can use indices to get a desired value

In [48]:
newdf.head(2)


Unnamed: 0,A,B,C,D,E
0,65445.0,0.331947,0.659455,0.121918,0.68376
1,0.388713,0.352354,0.684922,0.964196,0.528006


In [49]:
#it starts from 0 and counts untill 4

newdf.iloc[0,3]

np.float64(0.12191767360456962)

In [50]:
newdf.iloc[[0,5], [1,2]]

Unnamed: 0,B,C
0,0.331947,0.659455
5,0.905248,0.069928


## Using inplace=True To Modify Original Data

In [51]:
newdf.drop(["A", "D"], axis=1, inplace=True)
newdf

Unnamed: 0,B,C,E
0,0.331947,0.659455,0.683760
1,0.352354,0.684922,0.528006
2,0.190592,0.611373,0.849372
3,0.389242,0.981091,0.158648
4,0.984730,0.492155,0.574345
...,...,...,...
329,0.717539,0.751780,0.390958
330,0.004028,0.571285,0.693596
331,0.688725,0.223865,0.494092
332,0.439532,0.527287,0.319952


## Using .reset_index To Reset The Index

- Doing this the index restarts from 0 however it adds a new column called index as well

- To remove this index column if we want, we can define drop=True

In [53]:
newdf.head(3)
newdf.reset_index(drop=True)

Unnamed: 0,B,C,E
0,0.331947,0.659455,0.683760
1,0.352354,0.684922,0.528006
2,0.190592,0.611373,0.849372
3,0.389242,0.981091,0.158648
4,0.984730,0.492155,0.574345
...,...,...,...
329,0.717539,0.751780,0.390958
330,0.004028,0.571285,0.693596
331,0.688725,0.223865,0.494092
332,0.439532,0.527287,0.319952


## Using df.dropna()

- It removes missing values from our data

## 1. Removing Entire Rows With Missing Values

In [None]:

import pandas as pd

df2 = pd.DataFrame({ "name": ["Ali", "Sara", None],
    "age": [20, None, 25]})

print("original:")
print(df2)


print("new:")
print(df2.dropna())

original:
   name   age
0   Ali  20.0
1  Sara   NaN
2  None  25.0
new:
  name   age
0  Ali  20.0


## 2. Drop Rows Only If All Values Are Missing


In [57]:
df2.dropna(how="all")
print(df2)

   name   age
0   Ali  20.0
1  Sara   NaN
2  None  25.0


## 3. Drop Rows Where Age Is A Missing Value

In [60]:
df2 = df2.dropna(subset=["age"])
print(df2)

   name   age
0   Ali  20.0
2  None  25.0


## Removing Duplicate Values

## 1. Drop Duplicates Normally

In [62]:
import pandas as pd

df = pd.DataFrame({
    "name": ["Ali", "Sara", "Ali", "John"],
    "age": [20, 22, 20, 30]
})

print("original:")
print(df)


print("new:")
print(df.drop_duplicates())

original:
   name  age
0   Ali   20
1  Sara   22
2   Ali   20
3  John   30
new:
   name  age
0   Ali   20
1  Sara   22
3  John   30


## 2. Drop Duplicates Column Wise

In [63]:
newdf = df.drop_duplicates(subset=["name"])
print(newdf)

   name  age
0   Ali   20
1  Sara   22
3  John   30


## 3. Drop Duplicates While Keeping The Last Occurrence

In [64]:
newdf1 = df.drop_duplicates(keep="last")
print(newdf1)

   name  age
1  Sara   22
2   Ali   20
3  John   30


In [79]:
df2 = df = pd.DataFrame({
    'brand': ['Yum Yum', 'Yum Yum', 'Indomie', 'Indomie', 'Indomie'],
    'style': [np.nan, np.nan, np.nan, np.nan, np.nan],
    'rating': [pd.NaT, 4, 3.5, 15, 5]
})

df.head()
df.dropna()
df.drop_duplicates(subset=["brand"])
df.info()
df['rating'].value_counts(dropna=True)
df.notnull()
df.isnull()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   brand   5 non-null      object 
 1   style   0 non-null      float64
 2   rating  4 non-null      object 
dtypes: float64(1), object(2)
memory usage: 252.0+ bytes


Unnamed: 0,brand,style,rating
0,False,True,True
1,False,True,False
2,False,True,False
3,False,True,False
4,False,True,False
