# Understanding Pandas

## Pandas has two core data structures the series and the dataframe objects

### A pandas series object is like an ordered dictionary, so you can access elements by their position as well as labels

### A pandas series object is analogous to a named list in R

In [4]:
import pandas as pd

dataFromList = pd.Series([0.25, 0.5, 0.75, "Make India Great Again!"], index=['a', 'b', 'c', 'd'])

print(dataFromList)

a                       0.25
b                        0.5
c                       0.75
d    Make India Great Again!
dtype: object


In [5]:
dataFromDict = pd.Series({"a" : 0.25, "b" :  0.5, "c" : 0.75, "d" : "Make India Great Again!"})

print(dataFromDict)

a                       0.25
b                        0.5
c                       0.75
d    Make India Great Again!
dtype: object


* __Below we access the same element from the pandas series object using both the index as well as the position of the element in the series__

In [6]:
print("Using the Index: " + str(dataFromDict['d']))

print("Using the Position: " + str(dataFromDict[3]))

Using the Index: Make India Great Again!
Using the Position: Make India Great Again!


## Pandas DataFrames

* __The Pandas DataFrame object is a combination of labelled series objects__

In [41]:
sampleDataFromDictOfDicts = pd.DataFrame({'Batch': { "Akhil" : 25, "Mayank": 27, "Lakshmi": 30, "Yogesh": 32, "Ashwin": 35},
                            'Scholarship_winners': { "Akhil" : 5, "Mayank": 4, "Lakshmi": 3, "Yogesh": 5, "Ashwin": 5}})

sampleDataFromDictOfDicts

Unnamed: 0,Batch,Scholarship_winners
Akhil,25,5
Ashwin,35,5
Lakshmi,30,3
Mayank,27,4
Yogesh,32,5


In [8]:
sampleDataFromDict = pd.DataFrame({'Batch': [25, 27, 30, 32, 35],
                                   'Scholarship_winners': [5,4,3,5,5]})

sampleDataFromDict

Unnamed: 0,Batch,Scholarship_winners
0,25,5
1,27,4
2,30,3
3,32,5
4,35,5


In [9]:
sampleDataFromList = pd.DataFrame([[25, 5], [27, 4], [30, 3], [32, 5], [35, 5]],
                                  columns = ["Batch", "Scholarship_winners"],
                                  index=["Akhil", "Mayank", "Lakshmi", "Yogesh", "Ashwin"])

sampleDataFromList

Unnamed: 0,Batch,Scholarship_winners
Akhil,25,5
Mayank,27,4
Lakshmi,30,3
Yogesh,32,5
Ashwin,35,5


# Going From Pandas to Numpy and Back

* __Sometimes data should be in a numpy array form before being passed into functions from other libraries, so you need to be comfortable converting objects from pandas dataframes to numpy arrays and vice versa__

* __You lose your indexes and columns, some times additional datatypes supported in pandas (ex: datetime) when you convert pandas dataframes to numpy arrays__

In [10]:
sampleNpData = sampleDataFromList.values # Pandas to Numpy Array

sampleNpData

array([[25,  5],
       [27,  4],
       [30,  3],
       [32,  5],
       [35,  5]], dtype=int64)

In [11]:
samplePdData = pd.DataFrame(sampleNpData, columns=sampleDataFromList.columns, 
                            index=["Akhil", "Mayank", "Lakshmi", "Yogesh", "Ashwin"])

samplePdData

Unnamed: 0,Batch,Scholarship_winners
Akhil,25,5
Mayank,27,4
Lakshmi,30,3
Yogesh,32,5
Ashwin,35,5


# Accessing Elements from a Pandas DataFrame

# iloc vs loc

### __.iloc lets you access the elements using the position of the elements__

![](img/pd_iloc.jpg)

In [45]:
allColumns = samplePdData.iloc[3, : ]

print("All Columns: " + "\n---------------------------------\n" + str(allColumns))

oneColumn = samplePdData.iloc[3, :1 ]

print("\n" + "Only the First Column: " + "\n---------------------------------\n" + str(oneColumn))

twoColumns = samplePdData.iloc[3, :2 ]

print("\n" + "Till The Second Column: " + "\n---------------------------------\n" + str(twoColumns))

All Columns: 
---------------------------------
Batch                  32
Scholarship_winners     5
Name: Yogesh, dtype: int64

Only the First Column: 
---------------------------------
Batch    32
Name: Yogesh, dtype: int64

Till The Second Column: 
---------------------------------
Batch                  32
Scholarship_winners     5
Name: Yogesh, dtype: int64


### __.loc lets you access the elements using the column and the index labels of the elements__

![](img/pd_loc.jpg)

In [13]:
samplePdData.loc["Lakshmi", :"Scholarship_winners"]

Batch                  30
Scholarship_winners     3
Name: Lakshmi, dtype: int64

## Indexing using [ ] returns a Pandas Series Object

In [14]:
samplePdData['Batch']

Akhil      25
Mayank     27
Lakshmi    30
Yogesh     32
Ashwin     35
Name: Batch, dtype: int64

* __ You might come across this method of indexing when you want to conditionally access elements from a pandas DataFrame__

In [15]:
samplePdData.loc[samplePdData['Batch'] > 28, : ] # Here we access only those rows where the Batch number is higher than 28

Unnamed: 0,Batch,Scholarship_winners
Lakshmi,30,3
Yogesh,32,5
Ashwin,35,5


## GroupBy in Pandas

### Finding the mean age at each Insofe Branch

In [16]:
cityData = pd.DataFrame( {"Name" : ["Akhil", "Mayank", "Yogesh", "Lakshmi", "Jeevan" , "Sridhar"] , 
                    "City" : ["Bangalore", "Hyderabad", "Bangalore", "Hyderabad", "Bangalore", "Hyderabad"],
                    "Age" : [25, 25, 23, 26, 27, 40]})
cityData

Unnamed: 0,Age,City,Name
0,25,Bangalore,Akhil
1,25,Hyderabad,Mayank
2,23,Bangalore,Yogesh
3,26,Hyderabad,Lakshmi
4,27,Bangalore,Jeevan
5,40,Hyderabad,Sridhar


* __Applying the GroupBy method returns a groupby object__

In [17]:
cityData.groupby(by = "City")

<pandas.core.groupby.DataFrameGroupBy object at 0x000002646A7E2390>

* __We can view the structure by converting it into a list__

In [18]:
list(cityData.groupby(by = "City"))

[('Bangalore',    Age       City    Name
  0   25  Bangalore   Akhil
  2   23  Bangalore  Yogesh
  4   27  Bangalore  Jeevan), ('Hyderabad',    Age       City     Name
  1   25  Hyderabad   Mayank
  3   26  Hyderabad  Lakshmi
  5   40  Hyderabad  Sridhar)]

* __You can perform operations on a groupby object to return a DataFrame__

In [19]:
cityData.groupby(by = "City").mean()

Unnamed: 0_level_0,Age
City,Unnamed: 1_level_1
Bangalore,25.0
Hyderabad,30.333333


In [20]:
cityData.groupby(by = "City").count()

Unnamed: 0_level_0,Age,Name
City,Unnamed: 1_level_1,Unnamed: 2_level_1
Bangalore,3,3
Hyderabad,3,3


* __We can also apply any function to a groupby object using the apply method__

![](img/pd_apply.png)

In [21]:
cityData.groupby(by = "City").apply(sum)

Unnamed: 0_level_0,Age,City,Name
City,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Bangalore,75,BangaloreBangaloreBangalore,AkhilYogeshJeevan
Hyderabad,91,HyderabadHyderabadHyderabad,MayankLakshmiSridhar


<img src='img/pandas_cheat_sheet.jpg' />