# <center>Python Toolset for Data Science</center>

## Section I - Datatypes

<u> Properties: </u>

Arrays are homogenous and mutable datatypes that allows duplicate values. <br>
Lists are similar to arrays except that they support heterogenous datatypes.<br>
Tuples are have same properties are lists but they are immutable.<br>
Dictionaries are identical to tuples but are mutable and doesn't allow duplicates

<u> When to use </u>

Arrays - Random access and have homogenous datatypes. <br>
Lists - Don't need random access. When you need an iterable collection that is often modified.<br>
Tuples - When data can't change <br>
Dictionaries - Logical association (Key-value). When your data is constantly being modified.<br>

![title](img/Python Datatypess.png)

Importing required modules

In [1]:
import pandas as pd
import numpy as np

In [2]:
dict = {
    "country": ["Brazil", "Russia", "India", "China", "South Africa"],
    "code" : ["BR", "RU", "IN", "CH", "SA"],
    "capital": ["Brasilia", "Moscow", "New Delhi", "Beijing", "Pretoria"],
    "area": [8.516, 17.10, 3.286, 9.597, 1.221],
    "population": [200.4, 143.5, 1252, 1357, 52.98],
}

Convert a dictionary to a Data Frame

In [3]:
brics_dict = pd.DataFrame(dict)

In [4]:
brics_dict

Unnamed: 0,area,capital,code,country,population
0,8.516,Brasilia,BR,Brazil,200.4
1,17.1,Moscow,RU,Russia,143.5
2,3.286,New Delhi,IN,India,1252.0
3,9.597,Beijing,CH,China,1357.0
4,1.221,Pretoria,SA,South Africa,52.98


## Section II - Index 

Helps effectively index through a dataframe

In [5]:
brics_dict.set_index('code')

Unnamed: 0_level_0,area,capital,country,population
code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
BR,8.516,Brasilia,Brazil,200.4
RU,17.1,Moscow,Russia,143.5
IN,3.286,New Delhi,India,1252.0
CH,9.597,Beijing,China,1357.0
SA,1.221,Pretoria,South Africa,52.98


In [8]:
brics_dict.index.values

array([0, 1, 2, 3, 4], dtype=int64)

Index is not a column. So you cannot delete an index. You can only reset it using reset_index()

In [6]:
brics_dict.reset_index()

Unnamed: 0,index,area,capital,code,country,population
0,0,8.516,Brasilia,BR,Brazil,200.4
1,1,17.1,Moscow,RU,Russia,143.5
2,2,3.286,New Delhi,IN,India,1252.0
3,3,9.597,Beijing,CH,China,1357.0
4,4,1.221,Pretoria,SA,South Africa,52.98


In [9]:
brics = pd.read_csv('data/brics.csv')
brics = brics.set_index('code')
brics

Unnamed: 0_level_0,country,capital,area,population
code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
BR,Brazil,Brasilia,8.516,200.4
RU,Russia,Moscow,17.1,143.5
IN,India,New Delhi,3.286,1252.0
CH,China,Beijing,9.597,1357.0
SA,South Africa,Pretoria,1.221,52.98


## Section III - Selecting Data

(a) Square Brackets <br>
(b) Advanced Methods (loc, iloc)

### <u> Square Brackets </u>

** Column Access - Square Brackets **

Using square brackets you can pass the name of the column which return a column of data

In [14]:
brics["country"]

code
BR          Brazil
RU          Russia
IN           India
CH           China
SA    South Africa
Name: country, dtype: object

This column is of a 'Series' datatype

In [15]:
type(brics["country"])

pandas.core.series.Series

To convert it to a dataframe, you need to enclose the column in double square brackets.

In [16]:
brics[["country"]]

Unnamed: 0_level_0,country
code,Unnamed: 1_level_1
BR,Brazil
RU,Russia
IN,India
CH,China
SA,South Africa


Now you see that the type is a dataframe instead of a Series

In [17]:
type(brics[["country"]])

pandas.core.frame.DataFrame

#### Row Access - Square Brackets

You can acess rows of data by passing the row numbers using slicing

In [47]:
brics[1:4]

Unnamed: 0_level_0,country,capital,area,population
code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
RU,Russia,Moscow,17.1,143.5
IN,India,New Delhi,3.286,1252.0
CH,China,Beijing,9.597,1357.0


You can also access a row and column data at once

#### Row and Column Access - Square Brackets

In [38]:
brics[1:4]['country']

code
RU    Russia
IN     India
CH     China
Name: country, dtype: object

### <u> Advanced Methods </u>

We can also select data using loc and iloc of Python

#### Row Access - loc()

We pass the index values to loc to select data from specific rows

In [31]:
brics.loc[["RU", "IN", "CH"]]

Unnamed: 0_level_0,country,capital,area,population
code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
RU,Russia,Moscow,17.1,143.5
IN,India,New Delhi,3.286,1252.0
CH,China,Beijing,9.597,1357.0


Please note the double square brackets the columns are enclosed in.

#### Column Access - loc()

To access at column level using loc, you can pass the name of the column along with slicing.

In [46]:
brics.loc[:, ["country"]]

Unnamed: 0_level_0,country
code,Unnamed: 1_level_1
BR,Brazil
RU,Russia
IN,India
CH,China
SA,South Africa


#### Row and Column Access - loc()

You can also access data by passing both the rows and columns of interest  

In [35]:
brics.loc[["RU", "IN", "CH"], ["country"]]

Unnamed: 0_level_0,country
code,Unnamed: 1_level_1
RU,Russia
IN,India
CH,China


#### loc() vs square brackets []

Looks like both do the same job but why might you have two different methods to select data?


(i)   allows slicing of columns   df.loc[ : , 'col1':'coln'] 

(ii)  allows slicing a single row df.loc[5]

(iii) modifies the original dataframe

      df[1:3]['col1'] = 5      Throws a SettingWithCopyWarning
      df.loc[1:3, 'col1'] = 5  Changes the value in the dataframe

#### Row Access - iloc()

In [54]:
brics.iloc[[1]]

Unnamed: 0_level_0,country,capital,area,population
code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
RU,Russia,Moscow,17.1,143.5


#### Column Access - iloc()

In [55]:
brics.iloc[:,[1]]

Unnamed: 0_level_0,capital
code,Unnamed: 1_level_1
BR,Brasilia
RU,Moscow
IN,New Delhi
CH,Beijing
SA,Pretoria


#### Row and Column Access - iloc()

In [62]:
brics.iloc[[1,3],[1,3]]

Unnamed: 0_level_0,capital,population
code,Unnamed: 1_level_1,Unnamed: 2_level_1
RU,Moscow,143.5
CH,Beijing,1357.0


#### Slicing using iloc

In [74]:
brics.iloc[:2,:2]

Unnamed: 0_level_0,country,capital
code,Unnamed: 1_level_1,Unnamed: 2_level_1
BR,Brazil,Brasilia
RU,Russia,Moscow


#### Selecting data using comparision operators

In [75]:
brics['population']>50

code
BR    True
RU    True
IN    True
CH    True
SA    True
Name: population, dtype: bool

In [78]:
brics[brics['population']>100]

Unnamed: 0_level_0,country,capital,area,population
code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
BR,Brazil,Brasilia,8.516,200.4
RU,Russia,Moscow,17.1,143.5
IN,India,New Delhi,3.286,1252.0
CH,China,Beijing,9.597,1357.0


In [86]:
brics[(brics['population']>100) & (brics['area']<10)]

Unnamed: 0_level_0,country,capital,area,population
code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
BR,Brazil,Brasilia,8.516,200.4
IN,India,New Delhi,3.286,1252.0
CH,China,Beijing,9.597,1357.0


## Section IV - Functions

In [99]:
def sum1(num1, num2):
    """Multiples given numbers"""
    sum = num1 + num2
    return sum

sum1(2,4)

6

In [100]:
def sum2(*args):
    """Multiples given numbers"""
    sum = 0
    for num in args:
        sum += num
    return sum

sum2(2,4,6,8)

20

## Section V - Lambda Functions

A fancier way of writing functions

General Function: <i>function(a,b) {a+b} </i>

Lambda Function: <i>lambda a,b : a+b </i>

In [101]:
sum3 = lambda num1, num2 : num1 + num2

In [102]:
sum3(10, 20)

30

<b> Why lambda function? </b>

Python supports a style of programming called functional programming where you can pass functions to other functions to do stuff.

#### apply() 

In [None]:
brics['length'] = brics['country'].apply(len)

In [None]:
brics

#### apply and lambda Functions

In [111]:
brics['pop_million'] = brics['population'].apply(lambda x: x * 100)

In [112]:
brics

Unnamed: 0_level_0,country,capital,area,population,length,pop_million
code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
BR,Brazil,Brasilia,8.516,200.4,6,20040.0
RU,Russia,Moscow,17.1,143.5,6,14350.0
IN,India,New Delhi,3.286,1252.0,5,125200.0
CH,China,Beijing,9.597,1357.0,5,135700.0
SA,South Africa,Pretoria,1.221,52.98,12,5298.0


In [109]:
# brics = brics.drop(['pop_billion'], axis=1)

## Section VI - Comprehensions

(a) List Comprehensions <br>
(b) Dict Comprehensions <br>
(c) Generator <br>

#### <u> List Comprehensions

Effective way of creating lists.

In [1]:
res = [num for num in range(11)]

In [2]:
res

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

Create new list from existing

In [3]:
new_res = [num + 1 for num in res]

In [4]:
new_res

[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]

Advanced Comprehensions

In [6]:
newer_res = [num for num in res if num % 2 == 0]

In [7]:
newer_res

[0, 2, 4, 6, 8, 10]

#### <u> Dict Comprehensions

Use {} instead of []

In [10]:
res = {num for num in range(11)}

In [11]:
res

{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10}

In [12]:
new_res = {num + 1 for num in res}

In [13]:
new_res

{1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11}

In [14]:
newer_res = {num for num in res if num % 2 == 0}

In [15]:
newer_res

{0, 2, 4, 6, 8, 10}

#### <u> Generator

In [23]:
gen_res = (num for num in range(5))

In [24]:
for i in gen_res:
    print(i)

0
1
2
3
4


In [25]:
gen_res1 = (num for num in range(3))

In [26]:
print(list(gen_res1))

[0, 1, 2]


In [27]:
evens = (num for num in range(11) if num % 2 == 0)

In [28]:
print(list(evens))

[0, 2, 4, 6, 8, 10]


<b> Why generator when you have list and dict comprehensions? </b>

If you are iterating only once, then generator is the best option.
If you want to interate multiple times, use list comprehension