# Introduction To Pandas üêº

- Pandas is a popular Python library used for data manipulation and analysis. It makes working with structured data (like tables, CSVs, Excel files, SQL data) very easy.

- It leverages speed and power of NumPy to make data analysis and preprocessing easy

## 1. Data Structures

- Series: 1D labeled array (like a single column).

- DataFrame: 2D labeled table (rows √ó columns, like a spreadsheet).

## 2. Data Operations

- Reading/writing data: CSV, Excel, SQL, JSON.

- Selecting, filtering, and slicing data.

- Handling missing data (NaN) with methods like dropna() and fillna().

- Aggregations: sum, mean, count, groupby operations.

= Sorting, merging, joining, and reshaping datasets.

In [325]:
import numpy as np
import pandas as pd

## 1. DataFrame (df)

- Consider it simply like excel sheets 

In [326]:
dict1 = {
    "name" : ["sidra", "harry", "shubh","skillf"],
    "marks" : [92, 43, 24, 17],
    "city" : ["amsterdam", "london", "paris", "japan"]
}

#dataframe is simply like an excel sheet 

df = pd.DataFrame(dict1)

In [327]:
df

Unnamed: 0,name,marks,city
0,sidra,92,amsterdam
1,harry,43,london
2,shubh,24,paris
3,skillf,17,japan


## 2. Uploading Data On A CSV File?

- For this, we simply use dot to_csv function

- It uploads data on a csv file easily

- Later we can manipulate, analyze and use data to our will

## 3. Removing Indices From Data?

- For this we can define index=False that removes indices/numbering from data

In [328]:
#we can also remove indices using index = False

df.to_csv("friends.csv", index=False)

## 4. Displaying First Number Of Rows?

- For this, we use df.head(no of rows we want to see)

In [329]:
#we can also see the first number of rows using df.head()

df.head(2)


Unnamed: 0,name,marks,city
0,sidra,92,amsterdam
1,harry,43,london


## 5. Displaying Last Number Of Rows?

- We can display last number of rows in pandas using df.tail(any number of rows)

In [330]:
#we can also see the last number of rows using df.tail()

df.tail(2)

Unnamed: 0,name,marks,city
2,shubh,24,paris
3,skillf,17,japan


## 6. Finding Numerical Stats?

- For this, we can simply use df.describe() to obtain numerical stats of numerical columns 

In [331]:
#we can also check all numerical statistics usinf df.describe()

df.describe()

Unnamed: 0,marks
count,4.0
mean,44.0
std,33.832923
min,17.0
25%,22.25
50%,33.5
75%,55.25
max,92.0


## 7. Reading A CSV File?


In [332]:
sidra = pd.read_csv("sidra.csv")

In [333]:
sidra

Unnamed: 0.6,Unnamed: 0.5,Unnamed: 0.4,Unnamed: 0.3,Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,train no,speed,city
0,first,first,first,first,first,first,1238480,92,amsterdam
1,second,second,second,second,second,second,3213234,43,london
2,third,third,third,third,third,third,8094380,100,paris
3,fourth,fourth,fourth,fourth,fourth,fourth,213214,17,japan


## 8. Accesing A Column?

- We need to use the variable to which we have saved the file


In [334]:
sidra['speed']
sidra["city"]

0    amsterdam
1       london
2        paris
3        japan
Name: city, dtype: object

## 9. Accessing A Value In A Column?

- For this we need to use the index number while keeping the above format same

In [335]:
sidra['speed'][2]

np.int64(100)

## 10. Changing The Value In CSV File

- We use dot loc to pevent the caveats warning 

In [336]:
sidra.loc[2,'speed'] = 100


In [337]:
sidra

Unnamed: 0.6,Unnamed: 0.5,Unnamed: 0.4,Unnamed: 0.3,Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,train no,speed,city
0,first,first,first,first,first,first,1238480,92,amsterdam
1,second,second,second,second,second,second,3213234,43,london
2,third,third,third,third,third,third,8094380,100,paris
3,fourth,fourth,fourth,fourth,fourth,fourth,213214,17,japan


In [338]:
sidra.to_csv("sidra.csv", index=False)

In [339]:
sidra

Unnamed: 0.6,Unnamed: 0.5,Unnamed: 0.4,Unnamed: 0.3,Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,train no,speed,city
0,first,first,first,first,first,first,1238480,92,amsterdam
1,second,second,second,second,second,second,3213234,43,london
2,third,third,third,third,third,third,8094380,100,paris
3,fourth,fourth,fourth,fourth,fourth,fourth,213214,17,japan


## 11. Modifying Indices

- We can modify indices per our will using dot index

In [340]:
sidra.index = ["first", "second", "third", "fourth"]

In [341]:
sidra

Unnamed: 0.6,Unnamed: 0.5,Unnamed: 0.4,Unnamed: 0.3,Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,train no,speed,city
first,first,first,first,first,first,first,1238480,92,amsterdam
second,second,second,second,second,second,second,3213234,43,london
third,third,third,third,third,third,third,8094380,100,paris
fourth,fourth,fourth,fourth,fourth,fourth,fourth,213214,17,japan


In [342]:
sidra.to_csv("sidra.csv")

## 1. Understanding Series

- It can a column or a row in entire dataframe 

In [343]:
ser = pd.Series(np.random.rand(10))

In [344]:
ser

0    0.509354
1    0.759615
2    0.973738
3    0.563507
4    0.959282
5    0.010069
6    0.108513
7    0.761688
8    0.938729
9    0.029429
dtype: float64

In [345]:
type(ser)

pandas.core.series.Series

## 2. Understanding DataFrame:

- It has multiple series in it 

In [346]:
newdf = pd.DataFrame(np.random.rand(334,5), index=np.arange(334))

In [347]:
newdf

Unnamed: 0,0,1,2,3,4
0,0.701468,0.134824,0.458881,0.624114,0.440990
1,0.663910,0.290025,0.029582,0.903593,0.602755
2,0.326076,0.491118,0.571173,0.192115,0.799999
3,0.956532,0.117040,0.466164,0.479919,0.847380
4,0.493829,0.912096,0.847250,0.513068,0.691186
...,...,...,...,...,...
329,0.304717,0.560529,0.885971,0.397548,0.602156
330,0.976855,0.368757,0.309985,0.379406,0.048042
331,0.784319,0.694875,0.113592,0.892542,0.400633
332,0.831258,0.575920,0.283062,0.345489,0.936575


In [348]:
type(newdf)

pandas.core.frame.DataFrame

In [349]:
newdf.dtypes

0    float64
1    float64
2    float64
3    float64
4    float64
dtype: object

In [350]:
newdf[0][0] = 0.3

You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.
A typical example is when you are setting values in a column of a DataFrame, like:

df["col"][row_indexer] = value

Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

  newdf[0][0] = 0.3


In [351]:
newdf

Unnamed: 0,0,1,2,3,4
0,0.300000,0.134824,0.458881,0.624114,0.440990
1,0.663910,0.290025,0.029582,0.903593,0.602755
2,0.326076,0.491118,0.571173,0.192115,0.799999
3,0.956532,0.117040,0.466164,0.479919,0.847380
4,0.493829,0.912096,0.847250,0.513068,0.691186
...,...,...,...,...,...
329,0.304717,0.560529,0.885971,0.397548,0.602156
330,0.976855,0.368757,0.309985,0.379406,0.048042
331,0.784319,0.694875,0.113592,0.892542,0.400633
332,0.831258,0.575920,0.283062,0.345489,0.936575


In [352]:
newdf.index

Index([  0,   1,   2,   3,   4,   5,   6,   7,   8,   9,
       ...
       324, 325, 326, 327, 328, 329, 330, 331, 332, 333],
      dtype='int64', length=334)

In [353]:
newdf.to_numpy()

array([[0.3       , 0.13482449, 0.45888102, 0.62411436, 0.44098977],
       [0.66390952, 0.29002501, 0.02958245, 0.90359296, 0.60275451],
       [0.32607643, 0.49111761, 0.57117335, 0.19211511, 0.79999947],
       ...,
       [0.78431883, 0.6948747 , 0.1135917 , 0.89254186, 0.40063288],
       [0.83125766, 0.57591973, 0.28306212, 0.3454893 , 0.93657536],
       [0.28792231, 0.91677249, 0.40246749, 0.64392873, 0.26191195]])

In [354]:
newdf.sort_index(axis=1, ascending=False)

Unnamed: 0,4,3,2,1,0
0,0.440990,0.624114,0.458881,0.134824,0.300000
1,0.602755,0.903593,0.029582,0.290025,0.663910
2,0.799999,0.192115,0.571173,0.491118,0.326076
3,0.847380,0.479919,0.466164,0.117040,0.956532
4,0.691186,0.513068,0.847250,0.912096,0.493829
...,...,...,...,...,...
329,0.602156,0.397548,0.885971,0.560529,0.304717
330,0.048042,0.379406,0.309985,0.368757,0.976855
331,0.400633,0.892542,0.113592,0.694875,0.784319
332,0.936575,0.345489,0.283062,0.575920,0.831258


In [355]:
type(newdf[0])

pandas.core.series.Series

## View Behaviour Of DataFrames

- Any changes made to newdf will be applied on df as well since newdf is the view of original df 

In [356]:
newdf2 = newdf

In [357]:
newdf[0][0] = 95794

You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.
A typical example is when you are setting values in a column of a DataFrame, like:

df["col"][row_indexer] = value

Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

  newdf[0][0] = 95794


In [358]:
newdf

Unnamed: 0,0,1,2,3,4
0,95794.000000,0.134824,0.458881,0.624114,0.440990
1,0.663910,0.290025,0.029582,0.903593,0.602755
2,0.326076,0.491118,0.571173,0.192115,0.799999
3,0.956532,0.117040,0.466164,0.479919,0.847380
4,0.493829,0.912096,0.847250,0.513068,0.691186
...,...,...,...,...,...
329,0.304717,0.560529,0.885971,0.397548,0.602156
330,0.976855,0.368757,0.309985,0.379406,0.048042
331,0.784319,0.694875,0.113592,0.892542,0.400633
332,0.831258,0.575920,0.283062,0.345489,0.936575


## Hardcoding Original DataFrame To Dot Copy

- We can hardcode dot copy to original old df to prevent any changes made to the newdf being applied on it also

- Here newdf remains the same, but newdf2 does not since newdf has been chardocded with .copy()

In [359]:
newdf2 = newdf.copy()

In [360]:
newdf2[0][0] = 59870

You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.
A typical example is when you are setting values in a column of a DataFrame, like:

df["col"][row_indexer] = value

Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

  newdf2[0][0] = 59870


In [361]:
newdf2

Unnamed: 0,0,1,2,3,4
0,59870.000000,0.134824,0.458881,0.624114,0.440990
1,0.663910,0.290025,0.029582,0.903593,0.602755
2,0.326076,0.491118,0.571173,0.192115,0.799999
3,0.956532,0.117040,0.466164,0.479919,0.847380
4,0.493829,0.912096,0.847250,0.513068,0.691186
...,...,...,...,...,...
329,0.304717,0.560529,0.885971,0.397548,0.602156
330,0.976855,0.368757,0.309985,0.379406,0.048042
331,0.784319,0.694875,0.113592,0.892542,0.400633
332,0.831258,0.575920,0.283062,0.345489,0.936575


In [362]:
newdf

Unnamed: 0,0,1,2,3,4
0,95794.000000,0.134824,0.458881,0.624114,0.440990
1,0.663910,0.290025,0.029582,0.903593,0.602755
2,0.326076,0.491118,0.571173,0.192115,0.799999
3,0.956532,0.117040,0.466164,0.479919,0.847380
4,0.493829,0.912096,0.847250,0.513068,0.691186
...,...,...,...,...,...
329,0.304717,0.560529,0.885971,0.397548,0.602156
330,0.976855,0.368757,0.309985,0.379406,0.048042
331,0.784319,0.694875,0.113592,0.892542,0.400633
332,0.831258,0.575920,0.283062,0.345489,0.936575


## Avoiding Copy Warning

- We can simply use .loc() to avoid copy warning 

### 1. .loc:

- it helps us access rows and columns using their row/column numbers and index numbers both

In [363]:
newdf.loc[0,0] = 654
newdf.head(2)

Unnamed: 0,0,1,2,3,4
0,654.0,0.134824,0.458881,0.624114,0.44099
1,0.66391,0.290025,0.029582,0.903593,0.602755


In [364]:
newdf.columns = list("ABCDE")
newdf

Unnamed: 0,A,B,C,D,E
0,654.000000,0.134824,0.458881,0.624114,0.440990
1,0.663910,0.290025,0.029582,0.903593,0.602755
2,0.326076,0.491118,0.571173,0.192115,0.799999
3,0.956532,0.117040,0.466164,0.479919,0.847380
4,0.493829,0.912096,0.847250,0.513068,0.691186
...,...,...,...,...,...
329,0.304717,0.560529,0.885971,0.397548,0.602156
330,0.976855,0.368757,0.309985,0.379406,0.048042
331,0.784319,0.694875,0.113592,0.892542,0.400633
332,0.831258,0.575920,0.283062,0.345489,0.936575


In [365]:
newdf.loc[0,"A"] = 65445
newdf.head()

Unnamed: 0,A,B,C,D,E
0,65445.0,0.134824,0.458881,0.624114,0.44099
1,0.66391,0.290025,0.029582,0.903593,0.602755
2,0.326076,0.491118,0.571173,0.192115,0.799999
3,0.956532,0.11704,0.466164,0.479919,0.84738
4,0.493829,0.912096,0.84725,0.513068,0.691186


In [366]:
newdf.drop("E", axis=1)

Unnamed: 0,A,B,C,D
0,65445.000000,0.134824,0.458881,0.624114
1,0.663910,0.290025,0.029582,0.903593
2,0.326076,0.491118,0.571173,0.192115
3,0.956532,0.117040,0.466164,0.479919
4,0.493829,0.912096,0.847250,0.513068
...,...,...,...,...
329,0.304717,0.560529,0.885971,0.397548
330,0.976855,0.368757,0.309985,0.379406
331,0.784319,0.694875,0.113592,0.892542
332,0.831258,0.575920,0.283062,0.345489


In [367]:
newdf.loc[[1,2], ["C", "D"]]

Unnamed: 0,C,D
1,0.029582,0.903593
2,0.571173,0.192115


## Running Complex Query

- Running query for the data smaller than 0.3 and bigger than 0.1

In [368]:
newdf.loc[(newdf["A"]<0.3) & (newdf['C']>0.1 )]

Unnamed: 0,A,B,C,D,E
7,0.034320,0.207290,0.668619,0.061213,0.980144
8,0.198457,0.161529,0.164404,0.222778,0.771709
13,0.066803,0.117219,0.354150,0.575906,0.560246
18,0.284251,0.366279,0.581135,0.312927,0.927830
19,0.083416,0.559270,0.447190,0.215360,0.604678
...,...,...,...,...,...
323,0.038469,0.276550,0.791567,0.117943,0.187242
324,0.173519,0.647553,0.622977,0.216961,0.726266
326,0.146753,0.000023,0.593809,0.192594,0.090468
327,0.121221,0.368060,0.478814,0.149420,0.973764


### 2. .iloc:

- We can use indices to get a desired value

In [371]:
newdf.head(2)


Unnamed: 0,A,B,C,D,E
0,65445.0,0.134824,0.458881,0.624114,0.44099
1,0.66391,0.290025,0.029582,0.903593,0.602755


In [None]:
#it starts from 0 and counts untill 4

newdf.iloc[0,3]

np.float64(0.6241143570077322)

In [375]:
newdf.iloc[[0,5], [1,2]]

Unnamed: 0,B,C
0,0.134824,0.458881
5,0.689493,0.995642
