<p align="center">
<img align="center" width="600" src="../imgs/logo.png">
<h3 align="center">Introduction to Data Science</h3>
<h4 align="center">Chapter 2: Pandas Apply</h4>
<h5 align="center">Yam Peleg</h5>
</p>
<hr>

Many of the slides and notebooks in this repository are based on other repositories and tutorials. 

**References for this notebook:**  

* **[Daniel Chen - Pandas For Everyone](https://github.com/chendaniely/pandas_for_everyone)**
* **[Guilherme Samora - Pandas Excercises](https://github.com/guipsamora/pandas_exercises)**
<hr>

### The Data

![](https://www.reno.gov/Home/ShowImage?id=7739&t=635620964226970000)

**Competition Description from Kaggle**  
With 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, this competition challenges you to predict the final price of each home.

**Data description**  
This is a detailed description of the 79 features and their entries, quite important for this competition.  
You can download the txt file here: [**download**](https://www.kaggle.com/c/5407/download/data_description.txt)

**References**  

* **[Kaggle: Comprehensive data exploration with Python](https://www.kaggle.com/pmarcelino/comprehensive-data-exploration-with-python)**
* **[Udemy: Python for Data Science and Machine Learning Bootcamp](https://www.udemy.com/python-for-data-science-and-machine-learning-bootcamp/)**
* **[Data School: Machine learning in Python with scikit-learn](https://www.youtube.com/playlist?list=PL5-da3qGB5ICeMbQuqbbCOQWcS6OYBr5A)**



### Pandas Apply:

Again, load the house prices data.

### Step 1. Import the necessary libraries

In [1]:
import pandas as pd

In [2]:
def my_function():
    pass

In [3]:
def my_sq(x):
    return x ** 2

In [4]:
my_sq(4)

16

In [5]:
assert my_sq(4) == 16

In [6]:
def avg_2(x, y):
    return (x + y) / 2

In [7]:
avg_2(10, 20)

15.0

In [8]:
import pandas as pd

In [9]:
df = pd.DataFrame({
    'a': [10, 20, 30],
    'b': [20, 30, 40]
})

In [10]:
df['a'] ** 2

0    100
1    400
2    900
Name: a, dtype: int64

In [11]:
df['a'].apply(my_sq)

0    100
1    400
2    900
Name: a, dtype: int64

In [12]:
def my_exp(x, e):
    return x ** e

In [13]:
my_exp(4, 2)

16

In [14]:
my_exp(4, 3)

64

In [15]:
df['a'].apply(my_exp, e=4)

0     10000
1    160000
2    810000
Name: a, dtype: int64

In [16]:
def print_me(x):
    print(x)

In [17]:
df.apply(print_me)

0    10
1    20
2    30
Name: a, dtype: int64
0    20
1    30
2    40
Name: b, dtype: int64


a    None
b    None
dtype: object

In [18]:
def avg_3(x, y, z):
    return (x + y + z) / 3

In [19]:
import numpy as np

In [20]:
def avg_3_apply(col):
    return np.mean(col)

In [21]:
df.apply(avg_3_apply)

a    20.0
b    30.0
dtype: float64

In [22]:
def avg_3_apply(col):
    x = col[0]
    y = col[1]
    z = col[2]
    return (x + y + z) / 3

In [23]:
df.apply(avg_3_apply)

a    20.0
b    30.0
dtype: float64

In [24]:
df['a'].mean()

20.0

In [25]:
df['a'] + df['b']

0    30
1    50
2    70
dtype: int64

In [26]:
def avg_2_mod(x, y):
    if (x == 20):
        return np.NaN
    else:
        return (x + y) / 2

In [27]:
avg_2(df['a'], df['b'])

0    15.0
1    25.0
2    35.0
dtype: float64

# Ex 2: Pandas ".apply"


### Step 1. Import the necessary libraries

In [28]:
import pandas as pd
import numpy

### Step 2. Import the dataset from this [address](https://raw.githubusercontent.com/ypeleg/data_hack_haifa/master/data/house_prices.csv) (You have it also in ../data). 

### Step 3. Assign it to a variable called df.

In [29]:
path = '../data/house_prices.csv'
# path = 'https://raw.githubusercontent.com/ypeleg/data_hack_haifa/master/data/house_prices.csv'
df = pd.read_csv(path)
df.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


### Step 4 Create a function that lowers strings. (hint: in python every str object has the method ".lower()')

In [30]:
def lowerer(x):
    return x.lower()

### Step 5. lower both SaleCondition and LotShape

In [31]:
df['SaleCondition'].apply(lowerer)
df['LotShape'].apply(lowerer)

0       reg
1       reg
2       ir1
3       ir1
4       ir1
5       ir1
6       reg
7       ir1
8       reg
9       reg
10      reg
11      ir1
12      ir2
13      ir1
14      ir1
15      reg
16      ir1
17      reg
18      reg
19      reg
20      ir1
21      reg
22      reg
23      reg
24      ir1
25      reg
26      reg
27      reg
28      ir1
29      ir1
       ... 
1430    ir3
1431    ir1
1432    reg
1433    ir1
1434    reg
1435    reg
1436    reg
1437    reg
1438    reg
1439    reg
1440    ir1
1441    reg
1442    reg
1443    reg
1444    reg
1445    reg
1446    ir1
1447    reg
1448    reg
1449    reg
1450    reg
1451    reg
1452    reg
1453    reg
1454    reg
1455    reg
1456    reg
1457    reg
1458    reg
1459    reg
Name: LotShape, Length: 1460, dtype: object

### Step 7. Print the last elements of the data set.

In [32]:
df.tail()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
1455,1456,60,RL,62.0,7917,Pave,,Reg,Lvl,AllPub,...,0,,,,0,8,2007,WD,Normal,175000
1456,1457,20,RL,85.0,13175,Pave,,Reg,Lvl,AllPub,...,0,,MnPrv,,0,2,2010,WD,Normal,210000
1457,1458,70,RL,66.0,9042,Pave,,Reg,Lvl,AllPub,...,0,,GdPrv,Shed,2500,5,2010,WD,Normal,266500
1458,1459,20,RL,68.0,9717,Pave,,Reg,Lvl,AllPub,...,0,,,,0,4,2010,WD,Normal,142125
1459,1460,20,RL,75.0,9937,Pave,,Reg,Lvl,AllPub,...,0,,,,0,6,2008,WD,Normal,147500


### Step 8. Did you notice the original dataframe is still In Upper Case? Why is that? Fix it and lower both SaleCondition and LotShape

In [33]:
df['SaleCondition'] = df['SaleCondition'].apply(lowerer)
df['LotShape'] = df['LotShape'].apply(lowerer)
df.tail()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
1455,1456,60,RL,62.0,7917,Pave,,reg,Lvl,AllPub,...,0,,,,0,8,2007,WD,normal,175000
1456,1457,20,RL,85.0,13175,Pave,,reg,Lvl,AllPub,...,0,,MnPrv,,0,2,2010,WD,normal,210000
1457,1458,70,RL,66.0,9042,Pave,,reg,Lvl,AllPub,...,0,,GdPrv,Shed,2500,5,2010,WD,normal,266500
1458,1459,20,RL,68.0,9717,Pave,,reg,Lvl,AllPub,...,0,,,,0,4,2010,WD,normal,142125
1459,1460,20,RL,75.0,9937,Pave,,reg,Lvl,AllPub,...,0,,,,0,6,2008,WD,normal,147500


### Step 9. Create a function called is_rich_dude that return a boolean value to a new column called rich_dude if the house (in your opinion) belongs to a rich dude.

In [34]:
def is_rich_dude(x):
    if x > 17:
        return True
    else:
        return False

In [35]:
df['rich_dude'] = df['SalePrice'].apply(is_rich_dude)
df.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice,rich_dude
0,1,60,RL,65.0,8450,Pave,,reg,Lvl,AllPub,...,,,,0,2,2008,WD,normal,208500,True
1,2,20,RL,80.0,9600,Pave,,reg,Lvl,AllPub,...,,,,0,5,2007,WD,normal,181500,True
2,3,60,RL,68.0,11250,Pave,,ir1,Lvl,AllPub,...,,,,0,9,2008,WD,normal,223500,True
3,4,70,RL,60.0,9550,Pave,,ir1,Lvl,AllPub,...,,,,0,2,2006,WD,abnorml,140000,True
4,5,60,RL,84.0,14260,Pave,,ir1,Lvl,AllPub,...,,,,0,12,2008,WD,normal,250000,True


### Step 10. Multiply every number of the dataset by 10. 
##### I know this makes no sense, it is just an exercise

In [36]:
def times10(x):
    if type(x) is int:
        return 10 * x
    return x

In [37]:
df.applymap(times10).head(10)

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice,rich_dude
0,10,600,RL,65.0,84500,Pave,,reg,Lvl,AllPub,...,,,,0,20,20080,WD,normal,2085000,True
1,20,200,RL,80.0,96000,Pave,,reg,Lvl,AllPub,...,,,,0,50,20070,WD,normal,1815000,True
2,30,600,RL,68.0,112500,Pave,,ir1,Lvl,AllPub,...,,,,0,90,20080,WD,normal,2235000,True
3,40,700,RL,60.0,95500,Pave,,ir1,Lvl,AllPub,...,,,,0,20,20060,WD,abnorml,1400000,True
4,50,600,RL,84.0,142600,Pave,,ir1,Lvl,AllPub,...,,,,0,120,20080,WD,normal,2500000,True
5,60,500,RL,85.0,141150,Pave,,ir1,Lvl,AllPub,...,,MnPrv,Shed,7000,100,20090,WD,normal,1430000,True
6,70,200,RL,75.0,100840,Pave,,reg,Lvl,AllPub,...,,,,0,80,20070,WD,normal,3070000,True
7,80,600,RL,,103820,Pave,,ir1,Lvl,AllPub,...,,,Shed,3500,110,20090,WD,normal,2000000,True
8,90,500,RM,51.0,61200,Pave,,reg,Lvl,AllPub,...,,,,0,40,20080,WD,abnorml,1299000,True
9,100,1900,RL,50.0,74200,Pave,,reg,Lvl,AllPub,...,,,,0,10,20080,WD,normal,1180000,True
