### FUNDAMENTAL PYTHON NOTES: Dataframes
#### Created by Ugur URESIN, AI Engineer, Data Scientist

#### LIBRARY IMPORT

In [7]:
import os
import math
import numpy as np
import pandas as pd

### DataFrame OPERATIONS

#### WORKING DIRECTORY OPERATIONS

In [5]:
## SHOW CURRENT WD
os.getcwd()

'/Users/UGUR/Desktop/mycodes/mynotebooks/python_codes'

In [25]:
## CHANGE THE WD
#path="/Users/UGUR/Desktop/mycodes"
#os.chdir(path)  #to set the path as wd

#### DATA IMPORT AND INITIAL CHECK

In [15]:
df = pd.read_csv('/Users/UGUR/Desktop/mycodes/_data/house_prices.csv')

In [20]:
## SHOW '3' SAMPLE ROWS (default=1)
df.sample(3) 

Unnamed: 0,house_id,neighborhood,area,bedrooms,bathrooms,style,price
1045,6845,A,973,0,0,lodge,253387
2735,5057,C,4244,6,4,victorian,1065371
1033,2953,B,1665,3,2,ranch,833452


In [23]:
## SHOW 'FIRST 3' ROWS (default=5)
df.head(3) 

Unnamed: 0,house_id,neighborhood,area,bedrooms,bathrooms,style,price
0,1112,B,1188,3,2,ranch,598291
1,491,B,3512,5,3,victorian,1744259
2,5952,B,1134,3,2,ranch,571669


In [24]:
## SHOW 'LAST 3' ROWS (default=5)
df.tail(3) 

Unnamed: 0,house_id,neighborhood,area,bedrooms,bathrooms,style,price
6025,5894,B,1518,2,1,lodge,760829
6026,5591,C,2270,4,2,ranch,575515
6027,6211,C,3355,5,3,victorian,844747


In [26]:
## SHOW BASIC INFO
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6028 entries, 0 to 6027
Data columns (total 7 columns):
house_id        6028 non-null int64
neighborhood    6028 non-null object
area            6028 non-null int64
bedrooms        6028 non-null int64
bathrooms       6028 non-null int64
style           6028 non-null object
price           6028 non-null int64
dtypes: int64(5), object(2)
memory usage: 329.7+ KB


In [41]:
## SHOW DESCRIPTIVE STATISTICS
df.describe()

Unnamed: 0,house_id,area,bedrooms,bathrooms,price
count,6028.0,6028.0,6028.0,6028.0,6028.0
mean,4110.536828,2136.849038,3.717651,2.158261,754207.9
std,2251.834665,1237.481101,1.70465,1.169621,523673.1
min,200.0,0.0,0.0,0.0,12167.0
25%,2167.5,1225.0,3.0,2.0,364135.0
50%,4120.5,1826.0,4.0,2.0,635759.0
75%,6070.25,3129.0,5.0,3.0,966675.2
max,7999.0,7447.0,8.0,5.0,3684602.0


In [30]:
## SHOW CORRELATION BETWEEN VARIABLES
df.corr()

Unnamed: 0,house_id,area,bedrooms,bathrooms,price
house_id,1.0,0.005591,0.008959,0.011725,0.005768
area,0.005591,1.0,0.901623,0.891481,0.823454
bedrooms,0.008959,0.901623,1.0,0.972768,0.743435
bathrooms,0.011725,0.891481,0.972768,1.0,0.735851
price,0.005768,0.823454,0.743435,0.735851,1.0


#### DATA GROUPING EXAMPLES

In [36]:
## GROUP THE ROWS BASED on 'bedrooms' & CALCULATE THE MEANS of THE CORRESPONDING PRICES
df.groupby(['bedrooms'])['price'].mean()

bedrooms
0    2.895395e+05
1    7.448761e+04
2    4.291953e+05
3    4.624925e+05
4    7.077788e+05
5    1.141981e+06
6    1.323303e+06
7    1.600795e+06
8    1.985054e+06
Name: price, dtype: float64

In [40]:
## GROUPING on 'bedrooms' and 'bathrooms' & CALCULATE THE MEANS OF THE CORRESPONDING PRICES
df.groupby(['bedrooms','bathrooms'])['price'].mean()

bedrooms  bathrooms
0         0            2.895395e+05
1         0            7.448761e+04
2         1            4.291953e+05
3         2            4.624925e+05
4         2            7.077788e+05
5         3            1.141981e+06
6         4            1.323303e+06
7         4            1.600795e+06
8         5            1.985054e+06
Name: price, dtype: float64

### FUNCTIONS

#### DEFINING A FUNCTION

**Function Name:** cylinder_volume<br>
**Arguments or Parameters:** height, radius

In [3]:
def cylinder_volume (height,radius): #function header
    pi = 3.1415                      #function body
    return height*pi*radius**2       #return value

In the above function 'pi' is a **local varible** and it can not be modified or used outside of the function!<br>
For example, if we want to print to 'pi', we would get an error as follows!

In [4]:
print(pi)

NameError: name 'pi' is not defined

In [5]:
## Example: CALCULATE THE CYLINDER VOLUME WHEN height=10 and radius=5
cylinder_volume(10,5)

785.3750000000001

**Note that:** Python does NOT allow functions to modify variables that are outside the function's scope!

#### DEFAULT ARGUMENTS

Let's define the same function as above <br>
**Function Name:** cylinder_volume<br>
**Arguments or Parameters:** height, radius<br>
<br>
Here, we would like to assign a default value for the radius!

In [6]:
def cylinder_volume2 (height,radius=5): #function header
    pi = 3.1415                         #function body
    return height*pi*radius**2          #return value

In [9]:
cylinder_volume2(10)

785.3750000000001

Even if one of the arguments is a **default argument** user still can use another value!

In [10]:
cylinder_volume2(10,2)

125.66000000000001

#### DOCUMENTATION STRINGS (DOCSTRING)

DOCSTRING is quite useful especially when a group of people (or a team) is working for the same project!<br>
Thanks to DOCSTRING's each member of the team can understand the purpose of a function which is created by another team member!<br>
An explanatory example is given below:

In [12]:
def myfunc1(arg1, arg2):
    """Explanation for this function is given here
    Arg1 means ...
    Arg2 means ..."""
    return value

#### LAMBDA EXPRESSIONS

Lambda expressions are anonymous functions.<br>
They have no names and they are mostly created by using built-in functions e.g. sum()<br>
<br>
Let's we create a function called 'double' that returns 'given argument' * 2!

In [None]:
## DOUBLE FUNCTION
def double(x):
    return x*2

In [14]:
## DOUBLE as a LAMBDA EXPRESSION
double = lambda x: x*2

Variable name does not have to be 'x'.

In [15]:
double = lambda num: num*2

In [16]:
## MULTIPLIER FUNCTION
def multiplier(x,y):
    return x*y

In [17]:
multiplier = lambda x,y: x*y

#### THE map FUNCTION

map() is a higher order built-in function that takes a function & iterable as inputs, and returns an integer that applies the function to each element of the iterable!

In [75]:
## EXAMPLE
import math

def circle_area(radius):
    """Calculates area of a circle with given radius"""
    return math.pi*(radius**2)

Here, we would like to calculates the areas for given 10 radius values!

In [76]:
radii = [1, 1.2, 2, 3, 3.6, 4, 4.8, 5, 5.2, 6]

In [77]:
## METHOD-1: ITERATING WITH A FOR LOOP
circle_areas = []

for r in radii:
    area = circle_area(r)
    circle_areas.append(area)

circle_areas

[3.141592653589793,
 4.523893421169302,
 12.566370614359172,
 28.274333882308138,
 40.71504079052372,
 50.26548245743669,
 72.38229473870884,
 78.53981633974483,
 84.94866535306801,
 113.09733552923255]

In [78]:
## METHOD-2: map() FUNCTION
## The format is map(function,iterable)
map(circle_area, radii)

<map at 0x113fe5940>

The output is not a list, it's a map object!

In [79]:
list(map(circle_area, radii))

[3.141592653589793,
 4.523893421169302,
 12.566370614359172,
 28.274333882308138,
 40.71504079052372,
 50.26548245743669,
 72.38229473870884,
 78.53981633974483,
 84.94866535306801,
 113.09733552923255]

In [104]:
## EXAMPLE-2
numbers = [[2,4,6],
           [6,6,9],
           [7,8,9]]

list(map(lambda x: sum(x)/len(x), numbers))

[4.0, 7.0, 8.0]

#### LAMBDA EXPRESSIONS WITH FILTER

filter is a higher-order built in function that takes a function and itertable as inputs and returns an iterator with the elements from the iterable for which the function returns TRUE!

In [106]:
## EXAMPLE
city_list = ['New York City', 'Berlin', 'Paris', 'Istanbul', 'Los Angeles', 'Mexico City', 'Guangzhou']

def is_short(name):
    return len(name)<10

short_cities = list(filter(is_short, city_list))
print(short_cities)

['Berlin', 'Paris', 'Istanbul', 'Guangzhou']
