# Neural Nine's tutorial on Pandas
[Video here](https://www.youtube.com/watch?v=EhYC02PD_gc)

### Series
kind of an array  
we can use **'index'** arguement to customise how we index our array (by default it's from 0 to n), and we can give same index to multiple elements but then when we will locate that index, we would get multiple values  
we can also use these index to select the values from a series by using series.Loc('a')

In [34]:
import pandas as pd

values = [1,2,3,4,5]
series = pd.Series(values, index=['a', 'b', 'c', 'd', 'c'])

series.loc['c']

c    3
c    5
dtype: int64

## Dataframes
a dataframe is a collection or group of series.  
it takes a dictionary where each key is associated with a list of values, key is column name and list is of rows in that column.  
dataframe is just like an excel file, each column has a name and then rows of values

In [35]:
df = pd.DataFrame({
    'name': ['shashi', 'ved', 'nirlaj'],
    'age': [89, 98, 78],
    'branch': ['qweds', 'qweds', 'cse']
    })
df

Unnamed: 0,name,age,branch
0,shashi,89,qweds
1,ved,98,qweds
2,nirlaj,78,cse


we can also customize our index,  
we can set a column as index, like 'name' and thus we can also now locate a row by it's name

In [36]:
df = df.set_index('name')
print(df)
df.loc['shashi']

        age branch
name              
shashi   89  qweds
ved      98  qweds
nirlaj   78    cse


age          89
branch    qweds
Name: shashi, dtype: object

operations between dataframes is done based on the indices of rows. i.e., same index are linked

In [37]:
df1 = pd.DataFrame({
    'a': [2,3,5]
}, index=[1,2,3])
df2 = pd.DataFrame({
    'a': [5,6,8]
}, index=[3,2,1])

In [38]:
df1 + df2

Unnamed: 0,a
1,10
2,9
3,10


we can also reset index. i.e., set index back to 0 to n and not any of the column

In [39]:
df = df.reset_index()
df

Unnamed: 0,name,age,branch
0,shashi,89,qweds
1,ved,98,qweds
2,nirlaj,78,cse


### Import and export of dataframes
we can export a dataframe as a csv file and we can also import a csv as a dataframe  
while importing from a csv, we would get index column as a column of the df (column name will be 'unnamed: 0'), so to solve that we would assign the column number 0 as the index column, so that the pandas know that the first column is itself the index and not a column on it's own  
or we can simply tell pandas to not save the index while exporting the data, last line (commented) is how we do that

In [None]:
df.to_csv('first.csv')

csvRead = pd.read_csv('first.csv')
csvRead1 = pd.read_csv('first.csv', index_col=0)
print(csvRead)
print('\nvs\n')
print(csvRead1)

# df.to_csv('something.csv', index=None)

   Unnamed: 0    name  age branch
0           0  shashi   89  qweds
1           1     ved   98  qweds
2           2  nirlaj   78    cse

vs

     name  age branch
0  shashi   89  qweds
1     ved   98  qweds
2  nirlaj   78    cse


we can export dataframes to **JSON**, **CSV** or simply a python **dictionary**

In [51]:
df.to_json('first.json')
dfDict = df.to_dict()

## Data exploration functions
first we will get some data to work with, we can get dataset from sklearn library

In [56]:
from sklearn.datasets import fetch_california_housing as cfh

cdf = cfh(as_frame=True).frame
cdf.head(10) #we can use dataframe.head(n) to load first n rows and dataframe.tail(n) for the last n

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedHouseVal
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24,3.521
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422
5,4.0368,52.0,4.761658,1.103627,413.0,2.139896,37.85,-122.25,2.697
6,3.6591,52.0,4.931907,0.951362,1094.0,2.128405,37.84,-122.25,2.992
7,3.12,52.0,4.797527,1.061824,1157.0,1.788253,37.84,-122.25,2.414
8,2.0804,42.0,4.294118,1.117647,1206.0,2.026891,37.84,-122.26,2.267
9,3.6912,52.0,4.970588,0.990196,1551.0,2.172269,37.84,-122.25,2.611


we can also use **dataframe.sample(n)** to load n random rows

In [57]:
cdf.sample(10)

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedHouseVal
7175,2.4375,47.0,4.896154,1.015385,1193.0,4.588462,34.05,-118.19,1.229
14865,2.507,30.0,3.96375,1.0775,2126.0,2.6575,32.64,-117.09,1.427
8010,4.8516,34.0,5.343434,1.078283,972.0,2.454545,33.86,-118.12,2.136
19381,1.7188,45.0,5.008065,1.040323,257.0,2.072581,37.77,-120.86,1.094
14033,2.6818,35.0,4.388013,1.037855,726.0,2.290221,32.75,-117.14,1.594
16198,1.425,43.0,3.690909,1.018182,1805.0,4.102273,37.96,-121.27,0.613
19747,3.7,29.0,5.883077,1.033846,859.0,2.643077,40.19,-122.24,0.705
12977,4.0417,31.0,5.421842,1.025696,1396.0,2.989293,38.67,-121.32,1.145
7702,4.4732,36.0,5.770149,1.01791,958.0,2.859701,33.96,-118.13,2.66
3318,2.5326,20.0,6.29249,1.29249,647.0,2.557312,39.05,-122.86,1.368


we can list all columns using df.columns

In [60]:
list(cdf.columns)

['MedInc',
 'HouseAge',
 'AveRooms',
 'AveBedrms',
 'Population',
 'AveOccup',
 'Latitude',
 'Longitude',
 'MedHouseVal']

sometimes we might have a dataframe which has a lot of columns, then we would like to limit the number of columns shown, here's how we can do that

In [61]:
pd.options.display.max_columns = 5
cdf

Unnamed: 0,MedInc,HouseAge,...,Longitude,MedHouseVal
0,8.3252,41.0,...,-122.23,4.526
1,8.3014,21.0,...,-122.22,3.585
2,7.2574,52.0,...,-122.24,3.521
3,5.6431,52.0,...,-122.25,3.413
4,3.8462,52.0,...,-122.25,3.422
...,...,...,...,...,...
20635,1.5603,25.0,...,-121.09,0.781
20636,2.5568,18.0,...,-121.21,0.771
20637,1.7000,17.0,...,-121.22,0.923
20638,1.8672,18.0,...,-121.32,0.847


we can see info about any dataframe by using df.info

In [63]:
cdf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 9 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   MedInc       20640 non-null  float64
 1   HouseAge     20640 non-null  float64
 2   AveRooms     20640 non-null  float64
 3   AveBedrms    20640 non-null  float64
 4   Population   20640 non-null  float64
 5   AveOccup     20640 non-null  float64
 6   Latitude     20640 non-null  float64
 7   Longitude    20640 non-null  float64
 8   MedHouseVal  20640 non-null  float64
dtypes: float64(9)
memory usage: 1.4 MB


we can get a quick overview of our data by using df.describe()

In [66]:
pd.options.display.max_columns = 10 #just to see all columns
cdf.describe()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedHouseVal
count,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0
mean,3.870671,28.639486,5.429,1.096675,1425.476744,3.070655,35.631861,-119.569704,2.068558
std,1.899822,12.585558,2.474173,0.473911,1132.462122,10.38605,2.135952,2.003532,1.153956
min,0.4999,1.0,0.846154,0.333333,3.0,0.692308,32.54,-124.35,0.14999
25%,2.5634,18.0,4.440716,1.006079,787.0,2.429741,33.93,-121.8,1.196
50%,3.5348,29.0,5.229129,1.04878,1166.0,2.818116,34.26,-118.49,1.797
75%,4.74325,37.0,6.052381,1.099526,1725.0,3.282261,37.71,-118.01,2.64725
max,15.0001,52.0,141.909091,34.066667,35682.0,1243.333333,41.95,-114.31,5.00001


### getting series from a dataframe
we can get individual columns as series from our dataframe, here's how

In [None]:
s1 = cdf.HouseAge
#or
s2 = cdf['HouseAge']
s1, s2 #both are same
print(type(s1), type(s2))

<class 'pandas.core.series.Series'> <class 'pandas.core.series.Series'>


and then we can perform multiple functions on this series like s1.mean(), .max, .min, .mode, etc.  
there can be APIs where we get data in dataframes format, for example we can get stock data using yfinance library

In [None]:
import yfinance as yf

apple_df = yf.download('AAPL')

  apple_df = yf.download('AAPL')
[*********************100%***********************]  1 of 1 completed


In [77]:
print(type(apple_df))
apple_df

<class 'pandas.core.frame.DataFrame'>


Price,Close,High,Low,Open,Volume
Ticker,AAPL,AAPL,AAPL,AAPL,AAPL
Date,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
2025-08-11,227.179993,229.559998,224.759995,227.919998,61806100
2025-08-12,229.649994,230.800003,227.070007,228.009995,55626200
2025-08-13,233.330002,235.0,230.429993,231.070007,69878500
2025-08-14,232.779999,235.119995,230.850006,234.059998,51916300
2025-08-15,231.589996,234.279999,229.339996,234.0,56038700
2025-08-18,230.889999,233.119995,230.110001,231.699997,37476200
2025-08-19,230.559998,232.869995,229.350006,231.279999,39402600
2025-08-20,226.009995,230.470001,225.770004,229.979996,42263900
2025-08-21,224.899994,226.520004,223.779999,226.270004,30621200
2025-08-22,227.759995,229.089996,225.410004,226.169998,42477800
