# Python Toolset for Data Science

## Section I - Datatypes

### Properties: 

- Arrays are homogenous and mutable datatypes that allows duplicate values. <br>
- Lists are similar to arrays except that they support heterogenous datatypes.<br>
- Tuples have same properties as lists but they are immutable.<br>
- Dictionaries are mutable datatypes with key-value pairs and doesn't allow duplicates

### When to use

- Arrays - Random access and have homogenous datatypes. <br>
- Lists - Don't need random access. When you need an iterable collection that is often modified.<br>
- Tuples - When data can't change <br>
- Dictionaries - Logical association (Key-value). When your data is constantly being modified.<br>

![title](img/Python Datatypess.png)

Importing required modules

In [60]:
import pandas as pd
import numpy as np

In [61]:
brics_dict = {
    "country":["Brazil", "Russia", "India", "China", "South Africa"],
    "code" : ["BR", "RU", "IN", "CH", "SA"],
    "capital":["Brasilia", "Moscow", "New Delhi", "Beijing", "Pretoria"],
    "area":[8.516, 17.10, 3.286, 9.597, 1.221],
    "population":[200.4, 143.5, 1252, 1357, 52.98] 
}

Convert a dictionary to a Data Frame

In [62]:
brics_df = pd.DataFrame(brics_dict)

In [63]:
brics_df

Unnamed: 0,area,capital,code,country,population
0,8.516,Brasilia,BR,Brazil,200.4
1,17.1,Moscow,RU,Russia,143.5
2,3.286,New Delhi,IN,India,1252.0
3,9.597,Beijing,CH,China,1357.0
4,1.221,Pretoria,SA,South Africa,52.98


## Section II - Index 

Helps effectively index through a dataframe

In [64]:
brics_df.set_index('code')

Unnamed: 0_level_0,area,capital,country,population
code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
BR,8.516,Brasilia,Brazil,200.4
RU,17.1,Moscow,Russia,143.5
IN,3.286,New Delhi,India,1252.0
CH,9.597,Beijing,China,1357.0
SA,1.221,Pretoria,South Africa,52.98


In [65]:
brics_df.index.values

array([0, 1, 2, 3, 4], dtype=int64)

Index is not a column. So you cannot delete an index. You can only reset it using reset_index()

In [66]:
brics_df.reset_index()

Unnamed: 0,index,area,capital,code,country,population
0,0,8.516,Brasilia,BR,Brazil,200.4
1,1,17.1,Moscow,RU,Russia,143.5
2,2,3.286,New Delhi,IN,India,1252.0
3,3,9.597,Beijing,CH,China,1357.0
4,4,1.221,Pretoria,SA,South Africa,52.98


In [67]:
brics = pd.read_csv('data/brics.csv')
brics = brics.set_index('code')
brics

Unnamed: 0_level_0,country,capital,area,population
code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
BR,Brazil,Brasilia,8.516,200.4
RU,Russia,Moscow,17.1,143.5
IN,India,New Delhi,3.286,1252.0
CH,China,Beijing,9.597,1357.0
SA,South Africa,Pretoria,1.221,52.98


## Section III - Selecting Data

1. Square Brackets 
1. Advanced Methods (loc, iloc)

### Square Brackets

** Column Access - Square Brackets **

Using square brackets you can pass the name of the column which return a column of data

In [68]:
brics["country"]

code
BR          Brazil
RU          Russia
IN           India
CH           China
SA    South Africa
Name: country, dtype: object

This column is of a `Series` datatype

In [69]:
type(brics["country"])

pandas.core.series.Series

To convert it to a dataframe, you need to enclose the column in double square brackets.

In [70]:
brics[["country"]]

Unnamed: 0_level_0,country
code,Unnamed: 1_level_1
BR,Brazil
RU,Russia
IN,India
CH,China
SA,South Africa


Now you see that the type is a dataframe instead of a Series

In [71]:
type(brics[["country"]])

pandas.core.frame.DataFrame

#### Row Access - Square Brackets

You can acess rows of data by passing the row numbers using slicing

In [72]:
brics[1:4]

Unnamed: 0_level_0,country,capital,area,population
code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
RU,Russia,Moscow,17.1,143.5
IN,India,New Delhi,3.286,1252.0
CH,China,Beijing,9.597,1357.0


You can also access a row and column data at once

#### Row and Column Access - Square Brackets

In [73]:
brics[1:4]['country']

code
RU    Russia
IN     India
CH     China
Name: country, dtype: object

### Advanced Methods

We can also select data using loc and iloc of Python

#### Row Access - loc()

We pass the index values to loc to select data from specific rows

In [74]:
brics.loc[["RU", "IN", "CH"]]

Unnamed: 0_level_0,country,capital,area,population
code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
RU,Russia,Moscow,17.1,143.5
IN,India,New Delhi,3.286,1252.0
CH,China,Beijing,9.597,1357.0


Please note the double square brackets the columns are enclosed in.

#### Column Access - loc()

To access at column level using loc, you can pass the name of the column along with slicing.

In [75]:
brics.loc[:, ["country"]]

Unnamed: 0_level_0,country
code,Unnamed: 1_level_1
BR,Brazil
RU,Russia
IN,India
CH,China
SA,South Africa


#### Row and Column Access - loc()

You can also access data by passing both the rows and columns of interest  

In [76]:
brics.loc[["RU", "IN", "CH"], ["country"]]

Unnamed: 0_level_0,country
code,Unnamed: 1_level_1
RU,Russia
IN,India
CH,China


#### loc() vs square brackets []

Looks like both do the same job but why might you have two different methods to select data?
1. allows slicing of columns   df.loc[ : , 'col1':'coln'] 
1. allows slicing a single row df.loc[5]
1. modifies the original dataframe

      df[1:3]['col1'] = 5      Throws a SettingWithCopyWarning,
      
      df.loc[1:3, 'col1'] = 5  Changes the value in the dataframe

#### Row Access - iloc()

In [77]:
brics.iloc[[1]]

Unnamed: 0_level_0,country,capital,area,population
code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
RU,Russia,Moscow,17.1,143.5


#### Column Access - iloc()

In [78]:
brics.iloc[:,[1]]

Unnamed: 0_level_0,capital
code,Unnamed: 1_level_1
BR,Brasilia
RU,Moscow
IN,New Delhi
CH,Beijing
SA,Pretoria


#### Row and Column Access - iloc()

In [79]:
brics.iloc[[1,3],[1,3]]

Unnamed: 0_level_0,capital,population
code,Unnamed: 1_level_1,Unnamed: 2_level_1
RU,Moscow,143.5
CH,Beijing,1357.0


#### Slicing using iloc

In [80]:
brics.iloc[:2,:2]

Unnamed: 0_level_0,country,capital
code,Unnamed: 1_level_1,Unnamed: 2_level_1
BR,Brazil,Brasilia
RU,Russia,Moscow


#### Selecting data using comparision operators

In [81]:
brics['population'] > 50

code
BR    True
RU    True
IN    True
CH    True
SA    True
Name: population, dtype: bool

In [82]:
brics[brics['population'] > 100]

Unnamed: 0_level_0,country,capital,area,population
code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
BR,Brazil,Brasilia,8.516,200.4
RU,Russia,Moscow,17.1,143.5
IN,India,New Delhi,3.286,1252.0
CH,China,Beijing,9.597,1357.0


In [83]:
brics[(brics['population'] > 100) & (brics['area'] < 10)]

Unnamed: 0_level_0,country,capital,area,population
code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
BR,Brazil,Brasilia,8.516,200.4
IN,India,New Delhi,3.286,1252.0
CH,China,Beijing,9.597,1357.0


## Section IV - Functions

In [84]:
def sum1(num1, num2):
    """Multiples given numbers"""
    total = num1 + num2
    return total

sum1(2,4)

6

In [85]:
def sum2(*args):
    """Multiples given numbers"""
    total = 0
    for num in args:
        total += num
    return total

sum2(2,4,6,8)

20

In [86]:
def sum4(*args):
    return sum(args)

sum4(2, 4, 6, 8)

20

## Section V - Lambda Functions

A fancier way of writing functions

General Function: function(a,b) {a+b}

Lambda Function: lambda a,b : a+b

In [87]:
sum3 = lambda num1, num2 : num1 + num2

In [88]:
sum3(10, 20)

30

*Why lambda function?*

Python supports a style of programming called functional programming where you can pass functions to other functions to do stuff.

#### apply() 

In [89]:
brics['length'] = brics['country'].apply(len)

In [90]:
brics

Unnamed: 0_level_0,country,capital,area,population,length
code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
BR,Brazil,Brasilia,8.516,200.4,6
RU,Russia,Moscow,17.1,143.5,6
IN,India,New Delhi,3.286,1252.0,5
CH,China,Beijing,9.597,1357.0,5
SA,South Africa,Pretoria,1.221,52.98,12


#### apply and lambda Functions

In [91]:
brics['pop_million'] = brics['population'].apply(lambda x: x * 100)

In [92]:
brics

Unnamed: 0_level_0,country,capital,area,population,length,pop_million
code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
BR,Brazil,Brasilia,8.516,200.4,6,20040.0
RU,Russia,Moscow,17.1,143.5,6,14350.0
IN,India,New Delhi,3.286,1252.0,5,125200.0
CH,China,Beijing,9.597,1357.0,5,135700.0
SA,South Africa,Pretoria,1.221,52.98,12,5298.0


In [93]:
# brics = brics.drop(['pop_billion'], axis=1)

## Section VI - Comprehensions

1. List Comprehensions
1. Dict Comprehensions
1. Generator

#### List Comprehensions

Effective way of creating lists.

In [94]:
res = [num % 3 for num in range(11)]

In [95]:
res

[0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1]

Create new list from existing

In [96]:
new_res = [num % 3 + 1 for num in res]

In [97]:
new_res

[1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2]

Advanced Comprehensions

In [98]:
newer_res = [num % 3 for num in res if num % 2 == 0]

In [99]:
newer_res

[0, 2, 0, 2, 0, 2, 0]

#### Set Comprehensions

Use {} instead of []

In [100]:
res = {num % 3 for num in range(11)}

In [101]:
res

{0, 1, 2}

In [102]:
new_res = {num % 3 + 1 for num in res}

In [103]:
new_res

{1, 2, 3}

In [104]:
newer_res = {num % 3 for num in res if num % 2 == 0}

In [105]:
newer_res

{0, 2}

#### Generator

In [126]:
gen_res = (num for num in range(5))

In [127]:
print(gen_res)

<generator object <genexpr> at 0x00000000048617D8>


To print the elements in the generator, we have to iterate over the generator object

In [128]:
for i in gen_res:
    print(i)

0
1
2
3
4


In [108]:
gen_res1 = (num for num in range(3))

In [109]:
print(list(gen_res1))

[0, 1, 2]


In [110]:
evens = (num for num in range(11) if num % 2 ==0)

In [111]:
print(list(evens))

[0, 2, 4, 6, 8, 10]


*Why generator when you have list and dict comprehensions?*

The list comprehension will create the entire list in memory first while the generator expression creates the items on the fly. Hence, the generator expression uses less memory since it doesn't build the whole list at once. 

Let's see which of them are time efficient

In [135]:
# Measures execution time of small code snippets for given number of executions
import timeit

In [136]:
print(timeit.timeit('''list_com = [i for i in range(100) if i % 2 == 0]''', number=1000000))

8.008388165874806


In [137]:
print(timeit.timeit('''gen_exp = (i for i in range(100) if i % 2 == 0)''', number=1000000))

0.7537322206339923


There is a remarkable difference in the execution time. Hence, generators are time efficient.