## DataFrames

Consider DataFrame as a combination of Series objects put together to share the same index.

In [1]:
import pandas as pd
import numpy as np

In [44]:
df = pd.DataFrame([[10,20,30],[50,60,70], [20,30,40]],columns=['Col1','Col2', 'Col3'])
df

Unnamed: 0,Col1,Col2,Col3
0,10,20,30
1,50,60,70
2,20,30,40


In [46]:
df.columns  # to get the column names

Index(['Col1', 'Col2', 'Col3'], dtype='object')

In [93]:
df.dtypes

Col1    int64
Col2    int64
Col3    int64
dtype: object

In [49]:
df.index = ["R1","R2","R3"]
df

Unnamed: 0,Col1,Col2,Col3
R1,10,20,30
R2,50,60,70
R3,20,30,40


In [52]:
df1 = pd.DataFrame(np.random.randn(4,5),index=['R1','R2', 'R3', 'R4'],columns=['C1', 'C2', 'C3', 'C4','C5'])

In [53]:
df1

Unnamed: 0,C1,C2,C3,C4,C5
R1,-1.467514,-0.494095,-0.162535,0.485809,0.392489
R2,0.221491,-0.855196,1.54199,0.666319,-0.538235
R3,-0.568581,1.407338,0.641806,-0.9051,-0.391157
R4,1.028293,-1.972605,-0.866885,0.720788,-1.223082


In [91]:
df_dict = pd.DataFrame({"Brand":['Samsung','Realme','Mi','Nokia'], 
                 "Camera":[48,24,16,32], "RAM":[6,4,3,4], "Price" : [17500, 12000, 6999, 22000]})
df_dict

Unnamed: 0,Brand,Camera,RAM,Price
0,Samsung,48,6,17500
1,Realme,24,4,12000
2,Mi,16,3,6999
3,Nokia,32,4,22000


### Selection and Indexing


In [54]:
df1['C2']

R1   -0.494095
R2   -0.855196
R3    1.407338
R4   -1.972605
Name: C2, dtype: float64

In [55]:
# Pass a list of column names
df1[['C2','C4']]

Unnamed: 0,C2,C4
R1,-0.494095,0.485809
R2,-0.855196,0.666319
R3,1.407338,-0.9051
R4,-1.972605,0.720788


In [7]:
# Similar to SQL Syntax (NOT RECOMMENDED!)
df1.C1

R1    2.706850
R2   -0.319318
R3    0.528813
R4    0.955057
Name: C1, dtype: float64

DataFrame Columns are just like Series.

In [8]:
type(df1['C2'])

pandas.core.series.Series

**Creating a new column:**

In [56]:
df1['C6'] = df1['C1'] + df1['C2']

In [57]:
df1

Unnamed: 0,C1,C2,C3,C4,C5,C6
R1,-1.467514,-0.494095,-0.162535,0.485809,0.392489,-1.961609
R2,0.221491,-0.855196,1.54199,0.666319,-0.538235,-0.633705
R3,-0.568581,1.407338,0.641806,-0.9051,-0.391157,0.838757
R4,1.028293,-1.972605,-0.866885,0.720788,-1.223082,-0.944312


**Removing Columns**

In [62]:
df1.drop('C6',axis=1) # axis=1 for across the columns  

KeyError: "['C6'] not found in axis"

In [59]:
df1 #Not inplace(permanent change) unless specified!

Unnamed: 0,C1,C2,C3,C4,C5,C6
R1,-1.467514,-0.494095,-0.162535,0.485809,0.392489,-1.961609
R2,0.221491,-0.855196,1.54199,0.666319,-0.538235,-0.633705
R3,-0.568581,1.407338,0.641806,-0.9051,-0.391157,0.838757
R4,1.028293,-1.972605,-0.866885,0.720788,-1.223082,-0.944312


In [60]:
df1.drop('C6',axis=1,inplace=True)

In [61]:
df1

Unnamed: 0,C1,C2,C3,C4,C5
R1,-1.467514,-0.494095,-0.162535,0.485809,0.392489
R2,0.221491,-0.855196,1.54199,0.666319,-0.538235
R3,-0.568581,1.407338,0.641806,-0.9051,-0.391157
R4,1.028293,-1.972605,-0.866885,0.720788,-1.223082


In [16]:
df1.drop('R4',axis=0) #axis = 0 for across the rows

Unnamed: 0,C1,C2,C3,C4,C5
R1,2.70685,0.628133,0.907969,0.503826,0.651118
R2,-0.319318,-0.848077,0.605965,-2.018168,0.740122
R3,0.528813,-0.589001,0.188695,-0.758872,-0.933237


In [17]:
df1

Unnamed: 0,C1,C2,C3,C4,C5
R1,2.70685,0.628133,0.907969,0.503826,0.651118
R2,-0.319318,-0.848077,0.605965,-2.018168,0.740122
R3,0.528813,-0.589001,0.188695,-0.758872,-0.933237
R4,0.955057,0.190794,1.978757,2.605967,0.683509


**Selecting Rows**

In [18]:
df1.loc[['R2','R3']]

Unnamed: 0,C1,C2,C3,C4,C5
R2,-0.319318,-0.848077,0.605965,-2.018168,0.740122
R3,0.528813,-0.589001,0.188695,-0.758872,-0.933237


In [19]:
df1.iloc[0] #accessing rows based off of position instead of label 

C1    2.706850
C2    0.628133
C3    0.907969
C4    0.503826
C5    0.651118
Name: R1, dtype: float64

In [None]:
df1.iloc[:3]

In [20]:
df1.loc['R2']

C1   -0.319318
C2   -0.848077
C3    0.605965
C4   -2.018168
C5    0.740122
Name: R2, dtype: float64

In [21]:
df1.loc['R2','C4'] #Selecting subset of rows and columns

-2.018168244037392

In [65]:
df1.iloc[:3,1:]

Unnamed: 0,C2,C3,C4,C5
R1,-0.494095,-0.162535,0.485809,0.392489
R2,-0.855196,1.54199,0.666319,-0.538235
R3,1.407338,0.641806,-0.9051,-0.391157


In [22]:
df1

Unnamed: 0,C1,C2,C3,C4,C5
R1,2.70685,0.628133,0.907969,0.503826,0.651118
R2,-0.319318,-0.848077,0.605965,-2.018168,0.740122
R3,0.528813,-0.589001,0.188695,-0.758872,-0.933237
R4,0.955057,0.190794,1.978757,2.605967,0.683509


In [23]:
df1.loc[['R2','R3'],['C1','C5']]

Unnamed: 0,C1,C5
R2,-0.319318,0.740122
R3,0.528813,-0.933237


### Filtering data based on condition(Conditional Selection)

In [24]:
df1

Unnamed: 0,C1,C2,C3,C4,C5
R1,2.70685,0.628133,0.907969,0.503826,0.651118
R2,-0.319318,-0.848077,0.605965,-2.018168,0.740122
R3,0.528813,-0.589001,0.188695,-0.758872,-0.933237
R4,0.955057,0.190794,1.978757,2.605967,0.683509


In [25]:
df1>0.3

Unnamed: 0,C1,C2,C3,C4,C5
R1,True,True,True,True,True
R2,False,False,True,False,True
R3,True,False,False,False,False
R4,True,False,True,True,True


In [26]:
df1[df1>0]

Unnamed: 0,C1,C2,C3,C4,C5
R1,2.70685,0.628133,0.907969,0.503826,0.651118
R2,,,0.605965,,0.740122
R3,0.528813,,0.188695,,
R4,0.955057,0.190794,1.978757,2.605967,0.683509


In [27]:
df1

Unnamed: 0,C1,C2,C3,C4,C5
R1,2.70685,0.628133,0.907969,0.503826,0.651118
R2,-0.319318,-0.848077,0.605965,-2.018168,0.740122
R3,0.528813,-0.589001,0.188695,-0.758872,-0.933237
R4,0.955057,0.190794,1.978757,2.605967,0.683509


In [28]:
df1['C1']>0.3

R1     True
R2    False
R3     True
R4     True
Name: C1, dtype: bool

In [29]:
df1[df1['C1']>0.3] # returns the rows which satisfies the column condition 

Unnamed: 0,C1,C2,C3,C4,C5
R1,2.70685,0.628133,0.907969,0.503826,0.651118
R3,0.528813,-0.589001,0.188695,-0.758872,-0.933237
R4,0.955057,0.190794,1.978757,2.605967,0.683509


In [30]:
df1[df1['C1']>0.3]['C3']

R1    0.907969
R3    0.188695
R4    1.978757
Name: C3, dtype: float64

In [31]:
df1[df1['C1']>0.3][['C3','C5']]

Unnamed: 0,C3,C5
R1,0.907969,0.651118
R3,0.188695,-0.933237
R4,1.978757,0.683509


In [66]:
df1

Unnamed: 0,C1,C2,C3,C4,C5
R1,-1.467514,-0.494095,-0.162535,0.485809,0.392489
R2,0.221491,-0.855196,1.54199,0.666319,-0.538235
R3,-0.568581,1.407338,0.641806,-0.9051,-0.391157
R4,1.028293,-1.972605,-0.866885,0.720788,-1.223082


For multiple conditions use logical operators like &,| etc.

In [69]:
df1[(df1['C1']>0.1) & (df1['C3'] > 0.7)]

Unnamed: 0,C1,C2,C3,C4,C5
R2,0.221491,-0.855196,1.54199,0.666319,-0.538235


### Quiz 2

In [36]:
employees=pd.DataFrame({"Name":['Tom','Nick','John','Peter'], 
                 "Age":[25,26,37,22], "Salary" : [24500, 27000, 42000, 26000]})
employees

Unnamed: 0,Name,Age,Salary
0,Tom,25,24500
1,Nick,26,27000
2,John,37,42000
3,Peter,22,26000


What will be the output here?

**employees[(employees["Age"]>25) | (employees["Salary"]>25000)]**

a.

| Name | Age | Salary | 
|------|------|------| 
| Nick | 26 | 27000 |
| John | 37 | 42000 |
| Peter | 22 | 26000 |

b. 

| Name | Age | Salary | 
|------|------|------| 
| Nick | 26 | 27000 |
| John | 37 | 42000 |

c. 

| Name | Age | Salary | 
|------|------|------| 
| John | 37 | 42000 |
| Peter | 22 | 26000 |

d.

| Name | Age | Salary | 
|------|------|------|
| Tom | 25 | 24500 |
| Nick | 26 | 27000 |
| John | 37 | 42000 |
| Peter | 22 | 26000 |


### More about Indexing 

Let's see how to reset the index or setting it to something else.

In [71]:
df2 = pd.DataFrame({"Name":['Tom','Nick','John','Peter',"Ram","Shayam","Mohan","Sundar"], 
                 "Age":[15,16,7,12,11,13,15,18], "Height" : [160,164,154,170,165,172,175,180]})
df2

Unnamed: 0,Name,Age,Height
0,Tom,15,160
1,Nick,16,164
2,John,7,154
3,Peter,12,170
4,Ram,11,165
5,Shayam,13,172
6,Mohan,15,175
7,Sundar,18,180


In [78]:
df3 = df2[(df2["Age"]>11) & (df2["Height"]>150)]
df3

Unnamed: 0,Name,Age,Height
0,Tom,15,160
1,Nick,16,164
3,Peter,12,170
5,Shayam,13,172
6,Mohan,15,175
7,Sundar,18,180


In [79]:
# Reset to default 0,1...n index
df3.reset_index()

Unnamed: 0,index,Name,Age,Height
0,0,Tom,15,160
1,1,Nick,16,164
2,3,Peter,12,170
3,5,Shayam,13,172
4,6,Mohan,15,175
5,7,Sundar,18,180


In [81]:
df3.reset_index(drop=True)

Unnamed: 0,Name,Age,Height
0,Tom,15,160
1,Nick,16,164
2,Peter,12,170
3,Shayam,13,172
4,Mohan,15,175
5,Sundar,18,180


In [84]:
new_index = 'AB CD EF GH IJ KL MN OP'.split()
new_index

['AB', 'CD', 'EF', 'GH', 'IJ', 'KL', 'MN', 'OP']

In [85]:
df2['New'] = new_index

In [86]:
df2

Unnamed: 0,Name,Age,Height,New
0,Tom,15,160,AB
1,Nick,16,164,CD
2,John,7,154,EF
3,Peter,12,170,GH
4,Ram,11,165,IJ
5,Shayam,13,172,KL
6,Mohan,15,175,MN
7,Sundar,18,180,OP


In [87]:
df2.set_index('New')

Unnamed: 0_level_0,Name,Age,Height
New,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
AB,Tom,15,160
CD,Nick,16,164
EF,John,7,154
GH,Peter,12,170
IJ,Ram,11,165
KL,Shayam,13,172
MN,Mohan,15,175
OP,Sundar,18,180


In [88]:
df2

Unnamed: 0,Name,Age,Height,New
0,Tom,15,160,AB
1,Nick,16,164,CD
2,John,7,154,EF
3,Peter,12,170,GH
4,Ram,11,165,IJ
5,Shayam,13,172,KL
6,Mohan,15,175,MN
7,Sundar,18,180,OP


In [89]:
df2.set_index('New',inplace=True)

In [90]:
df2

Unnamed: 0_level_0,Name,Age,Height
New,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
AB,Tom,15,160
CD,Nick,16,164
EF,John,7,154
GH,Peter,12,170
IJ,Ram,11,165
KL,Shayam,13,172
MN,Mohan,15,175
OP,Sundar,18,180


### Quiz 3

In [32]:
employees

Unnamed: 0,Name,Age,Salary
0,Tom,25,24500
1,Nick,26,27000
2,John,37,42000
3,Peter,22,26000


Choose the correct statement(s) about the **employees** dataframe after the below operation?

**employees.set_index('Name',inplace=True).reset_index()**

* No change(same like earlier)

* It will have two columns **Age** and **Salary**

* **Name** column is set for index labels

* It will have four columns **index**, **Name**, **Age** and **Salary**