# Pandas
Pandas is a python library used for analysing or manipulating data.

In [2]:
import pandas as pd

## read_csv() function
This function takes a csv file as input and reads it. It returns a dataframe.
## DataFrame
A dataframe is a datastructure for storing rows and columns of a data just like a database in excel. It is constructed using lists and dictionaries.

In [3]:
data_frame = pd.read_csv("data.csv")

## Creating our own dataframe using pd.DataFrame(data)
We can create our own dataframe mannually using dictionaries and lists. It takes in the a dictionary to form a dataframe. The attributes or headers are the keys in dictionary and lists are the values to these keys respectively containing the values of the whole column. 

In [11]:
data = {'song':['a','b','c'], 'release_year':['2003','2004','2005'], 'singer':['d','e','f']}
my_data_frame = pd.DataFrame(data)
my_data_frame.head()

Unnamed: 0,song,release_year,singer
0,a,2003,d
1,b,2004,e
2,c,2005,f


## The head() function
This function prints out about first 5 rows of the data as shown below.

In [4]:
data_frame.head()

Unnamed: 0,male,age,education,currentSmoker,cigsPerDay,BPMeds,prevalentStroke,prevalentHyp,diabetes,totChol,sysBP,diaBP,BMI,heartRate,glucose,TenYearCHD
0,1,39,4.0,0,0.0,0.0,0,0,0,195.0,106.0,70.0,26.97,80.0,77.0,0
1,0,46,2.0,0,0.0,0.0,0,0,0,250.0,121.0,81.0,28.73,95.0,76.0,0
2,1,48,1.0,1,20.0,0.0,0,0,0,245.0,127.5,80.0,25.34,75.0,70.0,0
3,0,61,3.0,1,30.0,0.0,0,1,0,225.0,150.0,95.0,28.58,65.0,103.0,1
4,0,46,3.0,1,23.0,0.0,0,0,0,285.0,130.0,84.0,23.1,85.0,85.0,0


## Getting a column from a dataframe
To get a column from a dataframe, we can use the following syntax. The outer square bracket means that we are accessing a column. The inner list is for specifying the columns that we want to extract from our current dataframe.

In [6]:
data_frame2 = data_frame[['male']]
#we can also select two or more columns by specifying the attribute names in the inner list.
data_frame3 = data_frame[['male', 'currentSmoker']]

## Getting the unique values from a dataframe using unique() function
The unique function returns the unique values in a dataframe. For example in male there are two unique values: 1 & 0

In [42]:
print(data_frame2['male'].unique())
print(data_frame3['currentSmoker'].unique())

[1 0]
[0 1]


## Understanding the difference b/w pandas series and pandas dataframe and the diff. b/w [] and [ [ ] ]
Here's a link to understand [] / [ [] ] <br>
https://stackoverflow.com/questions/33813223/whats-the-difference-between-and-in-pandas

A <b>pandas series</b> is a one-dimensional data structure that comprises of key-value pair, where keys/labels are the indices and values are the values stored on that index.

A <b>pandas DataFrame</b> is a two-dimensional data structure that can be thought of as a spreadsheet. It can also be thought of as a collection of two or more series with common indices.

Here's a link for more info: <br>
https://www.educative.io/answers/series-vs-dataframe-in-pandas

In [15]:
type(data_frame['male'])

pandas.core.series.Series

In [16]:
type(data_frame[['male']])

pandas.core.frame.DataFrame

## Getting a rows and columns from a dataframe using data_frame.iloc[] and data_frame.loc[]

<u><b> iloc: </b> </u>
<ul>
<li>iloc stands for "integer location".</li>
<li>It is primarily used for indexing by integer position.</li>
<li>You pass integer-based indexers to extract data.</li>
<li>The syntax for iloc is dataframe.iloc[row_index, column_index].</li>
<li>It does not include the ending index, similar to Python's slicing.</li>
<li>For example, df.iloc[0:3, 1:4] selects rows 0 through 2 (exclusive of 3) and columns 1 through 3 (exclusive of 4).
iloc works with integer positions regardless of the labels in the index or columns.</li>
</ul>

<u> <b> loc: </b></u>

<ul>
<li>loc stands for "location".</li>
<li>It is primarily used for indexing by label.</li>
<li>You pass label-based indexers to extract data.</li>
<li>The syntax for loc is dataframe.loc[row_label, column_label].</li>
<li>It includes the ending label, similar to slicing in Python.</li>
<li>For example, df.loc['A':'C', 'x':'z'] selects rows labeled 'A' through 'C' (inclusive) and columns labeled 'x' through 'z' (inclusive).</li>
loc allows you to work with the labels present in the index or columns.</li>
</ul>


In [22]:
data_frame.iloc[2:5,5:8]

Unnamed: 0,BPMeds,prevalentStroke,prevalentHyp
2,0.0,0,0
3,0.0,0,1
4,0.0,0,0


In [23]:
data_frame.loc[2:5,'BPMeds':'prevalentHyp']

Unnamed: 0,BPMeds,prevalentStroke,prevalentHyp
2,0.0,0,0
3,0.0,0,1
4,0.0,0,0
5,0.0,0,1


## Selecting based on some condition
We can use loc for selecting rows and columns by some condition. Syntax x = data_frame.loc[data_frame.male == 1] 

<b>How it works?</b> <br>
The data_frame.male == 1 returns a pandas series containing boolean values (True and False). This is then passed to the loc function which treats it as labels. Whenever it encounters 'False' it will simply skip it and will only return rows on which there is True.<br>
<u><strong>Important</strong>: YES, loc CAN TAKE BOOLEAN VALUES BUT iloc CANNOT TAKE THEM DIRECTLY BY THE FOLLOWING METHODS.</u>

In [29]:
x = data_frame.loc[data_frame.male == 1]
display(x)

#for multiple conditions
y = data_frame.loc[(data_frame.male == 1) & (data_frame.BMI < 20)]
display(y)

Unnamed: 0,male,age,education,currentSmoker,cigsPerDay,BPMeds,prevalentStroke,prevalentHyp,diabetes,totChol,sysBP,diaBP,BMI,heartRate,glucose,TenYearCHD
0,1,39,4.0,0,0.0,0.0,0,0,0,195.0,106.0,70.0,26.97,80.0,77.0,0
2,1,48,1.0,1,20.0,0.0,0,0,0,245.0,127.5,80.0,25.34,75.0,70.0,0
8,1,52,1.0,0,0.0,0.0,0,1,0,260.0,141.5,89.0,26.36,76.0,79.0,0
9,1,43,1.0,1,30.0,0.0,0,1,0,225.0,162.0,107.0,23.61,93.0,88.0,0
12,1,46,1.0,1,15.0,0.0,0,1,0,294.0,142.0,94.0,26.31,98.0,64.0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4227,1,43,4.0,1,20.0,0.0,0,0,0,187.0,129.5,88.0,25.62,80.0,75.0,0
4231,1,58,3.0,0,0.0,0.0,0,1,0,187.0,141.0,81.0,24.96,80.0,81.0,0
4232,1,68,1.0,0,0.0,0.0,0,1,0,176.0,168.0,97.0,23.14,60.0,79.0,1
4233,1,50,1.0,1,1.0,0.0,0,1,0,313.0,179.0,92.0,25.97,66.0,86.0,1


Unnamed: 0,male,age,education,currentSmoker,cigsPerDay,BPMeds,prevalentStroke,prevalentHyp,diabetes,totChol,sysBP,diaBP,BMI,heartRate,glucose,TenYearCHD
33,1,61,,1,5.0,0.0,0,0,0,175.0,134.0,82.5,18.59,72.0,75.0,1
62,1,53,1.0,1,20.0,0.0,0,0,0,220.0,123.5,75.0,19.64,78.0,73.0,0
90,1,53,1.0,1,20.0,0.0,0,0,0,188.0,138.0,89.0,18.23,60.0,75.0,0
182,1,36,1.0,1,40.0,0.0,0,0,0,215.0,118.0,76.0,18.99,96.0,97.0,0
254,1,51,4.0,1,15.0,0.0,0,0,0,238.0,125.0,80.0,19.36,60.0,66.0,0
312,1,39,1.0,1,20.0,0.0,0,0,0,258.0,110.0,65.0,19.97,65.0,85.0,0
320,1,44,1.0,1,20.0,0.0,0,0,0,197.0,118.0,81.0,17.44,70.0,75.0,0
351,1,49,2.0,1,5.0,0.0,0,0,0,187.0,110.0,67.0,19.26,78.0,85.0,0
377,1,67,1.0,0,0.0,0.0,0,0,0,203.0,122.0,74.0,15.54,96.0,79.0,0
384,1,40,1.0,1,35.0,0.0,0,0,0,195.0,122.5,66.5,19.98,60.0,72.0,0


In [40]:
#display = data_frame.loc[True,True,False,True] This statement will give error because we must pass a pandas series of boolean values.

# Sample DataFrame
data = {
    'A': [1, 2, 3, 4, 5],
    'B': [6, 7, 8, 9, 10],
    'C': [11, 12, 13, 14, 15]
}
df = pd.DataFrame(data)

# Create a boolean series manually
boolean_series = pd.Series([True, False, True, False, True])

# Use the boolean series as an indexer with loc
result = df.loc[boolean_series]
display(result)

Unnamed: 0,A,B,C
0,1,6,11
2,3,8,13
4,5,10,15


## Important Links

https://www.coursera.org/learn/python-for-applied-data-science-ai/ungradedWidget/ggzY3/reading-pandas

https://www.geeksforgeeks.org/difference-between-loc-and-iloc-in-pandas-dataframe/