# Machine Learning Seminar 2017 - MIM UW
### Workshop Anaconda + Jupyter + SciKit-learn + Pandas 
Michał Woś


### ANACONDA
* Download: środowisko Anaconda z Pythonem 2.7: https://www.anaconda.com/download/#linux
* `conda create --name ml-sem --file ml-sem-spec.txt` Stworzenie odseparowanego środowiska potrzebnego do tego notebooka (zainstaluje konieczne pakiety)
* `. activate ml-sem` aktywowanie odseparowanego środowiska
* `jupyter notebook` odpalenie notebooka - można także wyklikać w `anaconda-navigator`
* `conda install package_name` do doinstalowywania pakietów lub wygodne GUI do zarządzania całością `anaconda-navigator`

<img src="includes/anaconda1.png" />
<img src="includes/anaconda2.jpg" />
<img src="includes/anaconda3.jpg" />
<img src="includes/anaconda4.jpg" />
<img src="includes/anaconda5.jpg" />

* Czym jest? Pakiety, środowisko. Czym nie jest? Menadżerem konfiguracji, procesów.
* Izolacja oprogramowania, różnych jego wersji, które w innym przypadku mogłoby kolidować.
* Uproszczenie życia dla nieinformatycznych Data scientist.
* Wersja Enterprise zintegrowana z rozwiązaniami Cloudowymi i dockerem.
* GUI

### JUPYTER
##### Podstawowe polecenia
* `ctrl-enter` - wykonianie kodu komórki
* `ctrl-shift-enter` - wykonanie kodu komórki i przejście do następnej komórki/dodanie kolejnej komórki
* `tab` - uzupełnienie kodu
* `shift-tab` - pokaż sygnaturę funkcji

### PANDAS

In [3]:
import pandas as pd

### DataFrame
<img src="includes/dataframe.jpg" />

#### Tutoriale
* TL;DR https://pandas.pydata.org/pandas-docs/stable/10min.html
* Full: https://pandas.pydata.org/pandas-docs/stable/tutorials.html

In [139]:
pd.DataFrame({
    'letter': list('abcd'),
    'number': [1, 2, 3, 4]
})

Unnamed: 0,letter,number
0,a,1
1,b,2
2,c,3
3,d,4


In [140]:
df = pd.DataFrame(
    [
        ['a',   0.1,    1],
        ['b', '0.2', None],
        ['c',  None,    3],
        ['a',   1.5,    4],
    ],
    columns = ['litery', 'miks', 'liczby']
)
df

Unnamed: 0,litery,miks,liczby
0,a,0.1,1.0
1,b,0.2,
2,c,,3.0
3,a,1.5,4.0


In [229]:
print(type(df['litery']))
print(type(df[['litery', 'miks']]))

<class 'pandas.core.series.Series'>
<class 'pandas.core.frame.DataFrame'>


In [142]:
df.at[1, 'miks']

'0.2'

In [230]:
df.loc[1:3,'miks':]

Unnamed: 0,miks,liczby
1,0.2,
2,,3.0
3,1.5,4.0


In [144]:
print(df.dtypes)

litery     object
miks       object
liczby    float64
dtype: object


In [145]:
import numpy as np


print('Tylko numeryki:')
print(df.select_dtypes(include=[np.number]))
print('\nTylko nie numeryki:')
print(df.select_dtypes(exclude=[np.number]))

Tylko numeryki:
   liczby
0     1.0
1     NaN
2     3.0
3     4.0

Tylko nie numeryki:
  litery  miks
0      a   0.1
1      b   0.2
2      c  None
3      a   1.5


In [146]:
# Wyrażenia logiczne
df['litery'] == 'a'

0     True
1    False
2    False
3     True
Name: litery, dtype: bool

In [147]:
# Używanie logicznego wyrażenia do wybierania podzbioru
selector = df['litery'] == 'a'
df[selector]

Unnamed: 0,litery,miks,liczby
0,a,0.1,1.0
3,a,1.5,4.0


In [148]:
df[(df['litery'] == 'a') & (df['liczby'] == 4)]

Unnamed: 0,litery,miks,liczby
3,a,1.5,4.0


In [234]:
# Operacje na stringach
~df['litery'].str.contains('a')  # nie zawiera

0    True
1    True
2    True
3    True
Name: litery, dtype: bool

In [150]:
# Metoda apply - domyślnie nie modyfikuje oryginalnej ramki
counter = 0
def add_values(column):
    # column jest typu pandas.Series
    global counter  # NIE CZYŃ TAK!
    counter += 1
    return column + column


print('Wynik apply:')
print(df[df['litery'] == 'a'].apply(add_values))
print()
print('Oryginalny df:')
print(df)

Wynik apply:
  litery  miks  liczby
0     aa   0.2     2.0
3     aa   3.0     8.0
()
Oryginalny df:
  litery  miks  liczby
0      a   0.1     1.0
1      b   0.2     NaN
2      c  None     3.0
3      a   1.5     4.0


In [151]:
print(counter)

4


In [152]:
df.dropna(how='any')

Unnamed: 0,litery,miks,liczby
0,a,0.1,1.0
3,a,1.5,4.0


In [153]:
# Metoda apply - operacje na wierszach
def add_values(row):
    row['litery'] = int(row['liczby'] * row['miks']) * row['litery']
    return row


df[df['litery'] == 'a'].apply(add_values, axis=1)  # domyślny parametr axis=0 -> operacje na kolumnach

Unnamed: 0,litery,miks,liczby
0,,0.1,1.0
3,aaaaaa,1.5,4.0


In [154]:
# Modyfikacja wybranych wierszy w danej kolumnie: loc
df.loc[df['litery'] == 'a', 'litery'] = 'ww'

# nie tak, jesli chcesz zmieniac wartość!
# df['litery'][df['litery'] == 'a'] = 'b'

df

Unnamed: 0,litery,miks,liczby
0,ww,0.1,1.0
1,b,0.2,
2,c,,3.0
3,ww,1.5,4.0


#### Agregacje, grupowania, sortownaia, łączenia...

* Podpowiadanie komend `<tab>` - do pierwszej kropki przypisz na zmienną, wykonaj `ctrl-enter` potem `.` i `<tab>`

In [155]:
g = df.groupby('litery')
g.sum()

Unnamed: 0_level_0,liczby
litery,Unnamed: 1_level_1
b,
c,3.0
ww,5.0


In [156]:
df.sort_values(by='liczby', ascending=False)

Unnamed: 0,litery,miks,liczby
3,ww,1.5,4.0
2,c,,3.0
0,ww,0.1,1.0
1,b,0.2,


In [157]:
left = pd.DataFrame({'key': ['foo', 'foo'], 'lval': [1, 2]})
right = pd.DataFrame({'key': ['foo', 'bar'], 'rval': [4, 5]})
pd.merge(left, right, on='key')

Unnamed: 0,key,lval,rval
0,foo,1,4
1,foo,2,4


### PANDAS PROFILING

Dane - Kaggle: https://www.kaggle.com/c/house-prices-advanced-regression-techniques

In [5]:
import pandas_profiling
import numpy as np

train = pd.read_csv('train.csv', index_col='Id')
pfr = pandas_profiling.ProfileReport(train)
pfr.to_file("raw_train_raport.html")
pfr

0,1
Number of variables,81
Number of observations,1460
Total Missing (%),0.0%
Total size in memory,924.0 KiB
Average record size in memory,648.0 B

0,1
Numeric,38
Categorical,43
Boolean,0
Date,0
Text (Unique),0
Rejected,0
Unsupported,0

0,1
Distinct count,753
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,1162.6
Minimum,334
Maximum,4692
Zeros (%),0.0%

0,1
Minimum,334.0
5-th percentile,672.95
Q1,882.0
Median,1087.0
Q3,1391.2
95-th percentile,1831.2
Maximum,4692.0
Range,4358.0
Interquartile range,509.25

0,1
Standard deviation,386.59
Coef of variation,0.33251
Kurtosis,5.7458
Mean,1162.6
MAD,300.58
Skewness,1.3768
Sum,1697435
Variance,149450
Memory size,11.5 KiB

Value,Count,Frequency (%),Unnamed: 3
864,25,0.0%,
1040,16,0.0%,
912,14,0.0%,
848,12,0.0%,
894,12,0.0%,
672,11,0.0%,
816,9,0.0%,
630,9,0.0%,
936,7,0.0%,
960,7,0.0%,

Value,Count,Frequency (%),Unnamed: 3
334,1,0.0%,
372,1,0.0%,
438,1,0.0%,
480,1,0.0%,
483,7,0.0%,

Value,Count,Frequency (%),Unnamed: 3
2633,1,0.0%,
2898,1,0.0%,
3138,1,0.0%,
3228,1,0.0%,
4692,1,0.0%,

0,1
Distinct count,417
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,346.99
Minimum,0
Maximum,2065
Zeros (%),0.0%

0,1
Minimum,0
5-th percentile,0
Q1,0
Median,0
Q3,728
95-th percentile,1141
Maximum,2065
Range,2065
Interquartile range,728

0,1
Standard deviation,436.53
Coef of variation,1.258
Kurtosis,-0.55346
Mean,346.99
MAD,396.48
Skewness,0.81303
Sum,506609
Variance,190560
Memory size,11.5 KiB

Value,Count,Frequency (%),Unnamed: 3
0,829,0.0%,
728,10,0.0%,
504,9,0.0%,
672,8,0.0%,
546,8,0.0%,
720,7,0.0%,
600,7,0.0%,
896,6,0.0%,
780,5,0.0%,
862,5,0.0%,

Value,Count,Frequency (%),Unnamed: 3
0,829,0.0%,
110,1,0.0%,
167,1,0.0%,
192,1,0.0%,
208,1,0.0%,

Value,Count,Frequency (%),Unnamed: 3
1611,1,0.0%,
1796,1,0.0%,
1818,1,0.0%,
1872,1,0.0%,
2065,1,0.0%,

0,1
Distinct count,20
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,3.4096
Minimum,0
Maximum,508
Zeros (%),0.0%

0,1
Minimum,0
5-th percentile,0
Q1,0
Median,0
Q3,0
95-th percentile,0
Maximum,508
Range,508
Interquartile range,0

0,1
Standard deviation,29.317
Coef of variation,8.5985
Kurtosis,123.66
Mean,3.4096
MAD,6.7071
Skewness,10.304
Sum,4978
Variance,859.51
Memory size,11.5 KiB

Value,Count,Frequency (%),Unnamed: 3
0,1436,0.0%,
168,3,0.0%,
216,2,0.0%,
144,2,0.0%,
180,2,0.0%,
245,1,0.0%,
238,1,0.0%,
290,1,0.0%,
196,1,0.0%,
182,1,0.0%,

Value,Count,Frequency (%),Unnamed: 3
0,1436,0.0%,
23,1,0.0%,
96,1,0.0%,
130,1,0.0%,
140,1,0.0%,

Value,Count,Frequency (%),Unnamed: 3
290,1,0.0%,
304,1,0.0%,
320,1,0.0%,
407,1,0.0%,
508,1,0.0%,

0,1
Distinct count,3
Unique (%),0.0%
Missing (%),100.0%
Missing (n),1369

0,1
Grvl,50
Pave,41
(Missing),1369

Value,Count,Frequency (%),Unnamed: 3
Grvl,50,0.0%,
Pave,41,0.0%,
(Missing),1369,0.0%,

0,1
Distinct count,8
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,2.8664
Minimum,0
Maximum,8
Zeros (%),0.0%

0,1
Minimum,0
5-th percentile,2
Q1,2
Median,3
Q3,3
95-th percentile,4
Maximum,8
Range,8
Interquartile range,1

0,1
Standard deviation,0.81578
Coef of variation,0.2846
Kurtosis,2.2309
Mean,2.8664
MAD,0.57631
Skewness,0.21179
Sum,4185
Variance,0.66549
Memory size,11.5 KiB

Value,Count,Frequency (%),Unnamed: 3
3,804,0.0%,
2,358,0.0%,
4,213,0.0%,
1,50,0.0%,
5,21,0.0%,
6,7,0.0%,
0,6,0.0%,
8,1,0.0%,

Value,Count,Frequency (%),Unnamed: 3
0,6,0.0%,
1,50,0.0%,
2,358,0.0%,
3,804,0.0%,
4,213,0.0%,

Value,Count,Frequency (%),Unnamed: 3
3,804,0.0%,
4,213,0.0%,
5,21,0.0%,
6,7,0.0%,
8,1,0.0%,

0,1
Distinct count,5
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
1Fam,1220
TwnhsE,114
Duplex,52
Other values (2),74

Value,Count,Frequency (%),Unnamed: 3
1Fam,1220,0.0%,
TwnhsE,114,0.0%,
Duplex,52,0.0%,
Twnhs,43,0.0%,
2fmCon,31,0.0%,

0,1
Distinct count,5
Unique (%),0.0%
Missing (%),100.0%
Missing (n),37

0,1
TA,1311
Gd,65
Fa,45
(Missing),37

Value,Count,Frequency (%),Unnamed: 3
TA,1311,0.0%,
Gd,65,0.0%,
Fa,45,0.0%,
Po,2,0.0%,
(Missing),37,0.0%,

0,1
Distinct count,5
Unique (%),0.0%
Missing (%),100.0%
Missing (n),38

0,1
No,953
Av,221
Gd,134

Value,Count,Frequency (%),Unnamed: 3
No,953,0.0%,
Av,221,0.0%,
Gd,134,0.0%,
Mn,114,0.0%,
(Missing),38,0.0%,

0,1
Distinct count,637
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,443.64
Minimum,0
Maximum,5644
Zeros (%),0.0%

0,1
Minimum,0.0
5-th percentile,0.0
Q1,0.0
Median,383.5
Q3,712.25
95-th percentile,1274.0
Maximum,5644.0
Range,5644.0
Interquartile range,712.25

0,1
Standard deviation,456.1
Coef of variation,1.0281
Kurtosis,11.118
Mean,443.64
MAD,367.37
Skewness,1.6855
Sum,647714
Variance,208030
Memory size,11.5 KiB

Value,Count,Frequency (%),Unnamed: 3
0,467,0.0%,
24,12,0.0%,
16,9,0.0%,
20,5,0.0%,
686,5,0.0%,
616,5,0.0%,
936,5,0.0%,
662,5,0.0%,
428,4,0.0%,
655,4,0.0%,

Value,Count,Frequency (%),Unnamed: 3
0,467,0.0%,
2,1,0.0%,
16,9,0.0%,
20,5,0.0%,
24,12,0.0%,

Value,Count,Frequency (%),Unnamed: 3
1904,1,0.0%,
2096,1,0.0%,
2188,1,0.0%,
2260,1,0.0%,
5644,1,0.0%,

0,1
Distinct count,144
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,46.549
Minimum,0
Maximum,1474
Zeros (%),0.0%

0,1
Minimum,0.0
5-th percentile,0.0
Q1,0.0
Median,0.0
Q3,0.0
95-th percentile,396.2
Maximum,1474.0
Range,1474.0
Interquartile range,0.0

0,1
Standard deviation,161.32
Coef of variation,3.4656
Kurtosis,20.113
Mean,46.549
MAD,82.535
Skewness,4.2553
Sum,67962
Variance,26024
Memory size,11.5 KiB

Value,Count,Frequency (%),Unnamed: 3
0,1293,0.0%,
180,5,0.0%,
374,3,0.0%,
551,2,0.0%,
93,2,0.0%,
468,2,0.0%,
147,2,0.0%,
480,2,0.0%,
539,2,0.0%,
712,2,0.0%,

Value,Count,Frequency (%),Unnamed: 3
0,1293,0.0%,
28,1,0.0%,
32,1,0.0%,
35,1,0.0%,
40,1,0.0%,

Value,Count,Frequency (%),Unnamed: 3
1080,1,0.0%,
1085,1,0.0%,
1120,1,0.0%,
1127,1,0.0%,
1474,1,0.0%,

0,1
Distinct count,7
Unique (%),0.0%
Missing (%),100.0%
Missing (n),37

0,1
Unf,430
GLQ,418
ALQ,220
Other values (3),355

Value,Count,Frequency (%),Unnamed: 3
Unf,430,0.0%,
GLQ,418,0.0%,
ALQ,220,0.0%,
BLQ,148,0.0%,
Rec,133,0.0%,
LwQ,74,0.0%,
(Missing),37,0.0%,

0,1
Distinct count,7
Unique (%),0.0%
Missing (%),100.0%
Missing (n),38

0,1
Unf,1256
Rec,54
LwQ,46
Other values (3),66
(Missing),38

Value,Count,Frequency (%),Unnamed: 3
Unf,1256,0.0%,
Rec,54,0.0%,
LwQ,46,0.0%,
BLQ,33,0.0%,
ALQ,19,0.0%,
GLQ,14,0.0%,
(Missing),38,0.0%,

0,1
Distinct count,4
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,0.42534
Minimum,0
Maximum,3
Zeros (%),0.0%

0,1
Minimum,0
5-th percentile,0
Q1,0
Median,0
Q3,1
95-th percentile,1
Maximum,3
Range,3
Interquartile range,1

0,1
Standard deviation,0.51891
Coef of variation,1.22
Kurtosis,-0.8391
Mean,0.42534
MAD,0.49876
Skewness,0.59607
Sum,621
Variance,0.26927
Memory size,11.5 KiB

Value,Count,Frequency (%),Unnamed: 3
0,856,0.0%,
1,588,0.0%,
2,15,0.0%,
3,1,0.0%,

Value,Count,Frequency (%),Unnamed: 3
0,856,0.0%,
1,588,0.0%,
2,15,0.0%,
3,1,0.0%,

Value,Count,Frequency (%),Unnamed: 3
0,856,0.0%,
1,588,0.0%,
2,15,0.0%,
3,1,0.0%,

0,1
Distinct count,3
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,0.057534
Minimum,0
Maximum,2
Zeros (%),0.0%

0,1
Minimum,0
5-th percentile,0
Q1,0
Median,0
Q3,0
95-th percentile,1
Maximum,2
Range,2
Interquartile range,0

0,1
Standard deviation,0.23875
Coef of variation,4.1497
Kurtosis,16.397
Mean,0.057534
MAD,0.10861
Skewness,4.1034
Sum,84
Variance,0.057003
Memory size,11.5 KiB

Value,Count,Frequency (%),Unnamed: 3
0,1378,0.0%,
1,80,0.0%,
2,2,0.0%,

Value,Count,Frequency (%),Unnamed: 3
0,1378,0.0%,
1,80,0.0%,
2,2,0.0%,

Value,Count,Frequency (%),Unnamed: 3
0,1378,0.0%,
1,80,0.0%,
2,2,0.0%,

0,1
Distinct count,5
Unique (%),0.0%
Missing (%),100.0%
Missing (n),37

0,1
TA,649
Gd,618
Ex,121
(Missing),37

Value,Count,Frequency (%),Unnamed: 3
TA,649,0.0%,
Gd,618,0.0%,
Ex,121,0.0%,
Fa,35,0.0%,
(Missing),37,0.0%,

0,1
Distinct count,780
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,567.24
Minimum,0
Maximum,2336
Zeros (%),0.0%

0,1
Minimum,0.0
5-th percentile,0.0
Q1,223.0
Median,477.5
Q3,808.0
95-th percentile,1468.0
Maximum,2336.0
Range,2336.0
Interquartile range,585.0

0,1
Standard deviation,441.87
Coef of variation,0.77898
Kurtosis,0.47499
Mean,567.24
MAD,353.28
Skewness,0.92027
Sum,828171
Variance,195250
Memory size,11.5 KiB

Value,Count,Frequency (%),Unnamed: 3
0,118,0.0%,
728,9,0.0%,
384,8,0.0%,
572,7,0.0%,
600,7,0.0%,
300,7,0.0%,
440,6,0.0%,
625,6,0.0%,
280,6,0.0%,
672,6,0.0%,

Value,Count,Frequency (%),Unnamed: 3
0,118,0.0%,
14,1,0.0%,
15,1,0.0%,
23,2,0.0%,
26,1,0.0%,

Value,Count,Frequency (%),Unnamed: 3
2042,1,0.0%,
2046,1,0.0%,
2121,1,0.0%,
2153,1,0.0%,
2336,1,0.0%,

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Y,1365
N,95

Value,Count,Frequency (%),Unnamed: 3
Y,1365,0.0%,
N,95,0.0%,

0,1
Distinct count,9
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Norm,1260
Feedr,81
Artery,48
Other values (6),71

Value,Count,Frequency (%),Unnamed: 3
Norm,1260,0.0%,
Feedr,81,0.0%,
Artery,48,0.0%,
RRAn,26,0.0%,
PosN,19,0.0%,
RRAe,11,0.0%,
PosA,8,0.0%,
RRNn,5,0.0%,
RRNe,2,0.0%,

0,1
Distinct count,8
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Norm,1445
Feedr,6
Artery,2
Other values (5),7

Value,Count,Frequency (%),Unnamed: 3
Norm,1445,0.0%,
Feedr,6,0.0%,
Artery,2,0.0%,
RRNn,2,0.0%,
PosN,2,0.0%,
RRAn,1,0.0%,
RRAe,1,0.0%,
PosA,1,0.0%,

0,1
Distinct count,6
Unique (%),0.0%
Missing (%),100.0%
Missing (n),1

0,1
SBrkr,1334
FuseA,94
FuseF,27
Other values (2),4

Value,Count,Frequency (%),Unnamed: 3
SBrkr,1334,0.0%,
FuseA,94,0.0%,
FuseF,27,0.0%,
FuseP,3,0.0%,
Mix,1,0.0%,
(Missing),1,0.0%,

0,1
Distinct count,120
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,21.954
Minimum,0
Maximum,552
Zeros (%),0.0%

0,1
Minimum,0.0
5-th percentile,0.0
Q1,0.0
Median,0.0
Q3,0.0
95-th percentile,180.15
Maximum,552.0
Range,552.0
Interquartile range,0.0

0,1
Standard deviation,61.119
Coef of variation,2.784
Kurtosis,10.431
Mean,21.954
MAD,37.66
Skewness,3.0899
Sum,32053
Variance,3735.6
Memory size,11.5 KiB

Value,Count,Frequency (%),Unnamed: 3
0,1252,0.0%,
112,15,0.0%,
96,6,0.0%,
120,5,0.0%,
144,5,0.0%,
192,5,0.0%,
216,5,0.0%,
252,4,0.0%,
116,4,0.0%,
156,4,0.0%,

Value,Count,Frequency (%),Unnamed: 3
0,1252,0.0%,
19,1,0.0%,
20,1,0.0%,
24,1,0.0%,
30,1,0.0%,

Value,Count,Frequency (%),Unnamed: 3
301,1,0.0%,
318,1,0.0%,
330,1,0.0%,
386,1,0.0%,
552,1,0.0%,

0,1
Distinct count,5
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
TA,1282
Gd,146
Fa,28
Other values (2),4

Value,Count,Frequency (%),Unnamed: 3
TA,1282,0.0%,
Gd,146,0.0%,
Fa,28,0.0%,
Ex,3,0.0%,
Po,1,0.0%,

0,1
Distinct count,4
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
TA,906
Gd,488
Ex,52

Value,Count,Frequency (%),Unnamed: 3
TA,906,0.0%,
Gd,488,0.0%,
Ex,52,0.0%,
Fa,14,0.0%,

0,1
Distinct count,15
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
VinylSd,515
HdBoard,222
MetalSd,220
Other values (12),503

Value,Count,Frequency (%),Unnamed: 3
VinylSd,515,0.0%,
HdBoard,222,0.0%,
MetalSd,220,0.0%,
Wd Sdng,206,0.0%,
Plywood,108,0.0%,
CemntBd,61,0.0%,
BrkFace,50,0.0%,
WdShing,26,0.0%,
Stucco,25,0.0%,
AsbShng,20,0.0%,

0,1
Distinct count,16
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
VinylSd,504
MetalSd,214
HdBoard,207
Other values (13),535

Value,Count,Frequency (%),Unnamed: 3
VinylSd,504,0.0%,
MetalSd,214,0.0%,
HdBoard,207,0.0%,
Wd Sdng,197,0.0%,
Plywood,142,0.0%,
CmentBd,60,0.0%,
Wd Shng,38,0.0%,
Stucco,26,0.0%,
BrkFace,25,0.0%,
AsbShng,20,0.0%,

0,1
Distinct count,5
Unique (%),0.0%
Missing (%),100.0%
Missing (n),1179

0,1
MnPrv,157
GdPrv,59
GdWo,54
(Missing),1179

Value,Count,Frequency (%),Unnamed: 3
MnPrv,157,0.0%,
GdPrv,59,0.0%,
GdWo,54,0.0%,
MnWw,11,0.0%,
(Missing),1179,0.0%,

0,1
Distinct count,6
Unique (%),0.0%
Missing (%),100.0%
Missing (n),690

0,1
Gd,380
TA,313
Fa,33
Other values (2),44
(Missing),690

Value,Count,Frequency (%),Unnamed: 3
Gd,380,0.0%,
TA,313,0.0%,
Fa,33,0.0%,
Ex,24,0.0%,
Po,20,0.0%,
(Missing),690,0.0%,

0,1
Distinct count,4
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,0.61301
Minimum,0
Maximum,3
Zeros (%),0.0%

0,1
Minimum,0
5-th percentile,0
Q1,0
Median,1
Q3,1
95-th percentile,2
Maximum,3
Range,3
Interquartile range,1

0,1
Standard deviation,0.64467
Coef of variation,1.0516
Kurtosis,-0.21724
Mean,0.61301
MAD,0.57942
Skewness,0.64957
Sum,895
Variance,0.41559
Memory size,11.5 KiB

Value,Count,Frequency (%),Unnamed: 3
0,690,0.0%,
1,650,0.0%,
2,115,0.0%,
3,5,0.0%,

Value,Count,Frequency (%),Unnamed: 3
0,690,0.0%,
1,650,0.0%,
2,115,0.0%,
3,5,0.0%,

Value,Count,Frequency (%),Unnamed: 3
0,690,0.0%,
1,650,0.0%,
2,115,0.0%,
3,5,0.0%,

0,1
Distinct count,6
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
PConc,647
CBlock,634
BrkTil,146
Other values (3),33

Value,Count,Frequency (%),Unnamed: 3
PConc,647,0.0%,
CBlock,634,0.0%,
BrkTil,146,0.0%,
Slab,24,0.0%,
Stone,6,0.0%,
Wood,3,0.0%,

0,1
Distinct count,4
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,1.5651
Minimum,0
Maximum,3
Zeros (%),0.0%

0,1
Minimum,0
5-th percentile,1
Q1,1
Median,2
Q3,2
95-th percentile,2
Maximum,3
Range,3
Interquartile range,1

0,1
Standard deviation,0.55092
Coef of variation,0.35201
Kurtosis,-0.85704
Mean,1.5651
MAD,0.52244
Skewness,0.036562
Sum,2285
Variance,0.30351
Memory size,11.5 KiB

Value,Count,Frequency (%),Unnamed: 3
2,768,0.0%,
1,650,0.0%,
3,33,0.0%,
0,9,0.0%,

Value,Count,Frequency (%),Unnamed: 3
0,9,0.0%,
1,650,0.0%,
2,768,0.0%,
3,33,0.0%,

Value,Count,Frequency (%),Unnamed: 3
0,9,0.0%,
1,650,0.0%,
2,768,0.0%,
3,33,0.0%,

0,1
Distinct count,7
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Typ,1360
Min2,34
Min1,31
Other values (4),35

Value,Count,Frequency (%),Unnamed: 3
Typ,1360,0.0%,
Min2,34,0.0%,
Min1,31,0.0%,
Mod,15,0.0%,
Maj1,14,0.0%,
Maj2,5,0.0%,
Sev,1,0.0%,

0,1
Distinct count,441
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,472.98
Minimum,0
Maximum,1418
Zeros (%),0.0%

0,1
Minimum,0.0
5-th percentile,0.0
Q1,334.5
Median,480.0
Q3,576.0
95-th percentile,850.1
Maximum,1418.0
Range,1418.0
Interquartile range,241.5

0,1
Standard deviation,213.8
Coef of variation,0.45204
Kurtosis,0.91707
Mean,472.98
MAD,160.02
Skewness,0.17998
Sum,690551
Variance,45713
Memory size,11.5 KiB

Value,Count,Frequency (%),Unnamed: 3
0,81,0.0%,
440,49,0.0%,
576,47,0.0%,
240,38,0.0%,
484,34,0.0%,
528,33,0.0%,
288,27,0.0%,
400,25,0.0%,
480,24,0.0%,
264,24,0.0%,

Value,Count,Frequency (%),Unnamed: 3
0,81,0.0%,
160,2,0.0%,
164,1,0.0%,
180,9,0.0%,
186,1,0.0%,

Value,Count,Frequency (%),Unnamed: 3
1220,1,0.0%,
1248,1,0.0%,
1356,1,0.0%,
1390,1,0.0%,
1418,1,0.0%,

0,1
Distinct count,5
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,1.7671
Minimum,0
Maximum,4
Zeros (%),0.0%

0,1
Minimum,0
5-th percentile,0
Q1,1
Median,2
Q3,2
95-th percentile,3
Maximum,4
Range,4
Interquartile range,1

0,1
Standard deviation,0.74732
Coef of variation,0.4229
Kurtosis,0.221
Mean,1.7671
MAD,0.58384
Skewness,-0.34255
Sum,2580
Variance,0.55848
Memory size,11.5 KiB

Value,Count,Frequency (%),Unnamed: 3
2,824,0.0%,
1,369,0.0%,
3,181,0.0%,
0,81,0.0%,
4,5,0.0%,

Value,Count,Frequency (%),Unnamed: 3
0,81,0.0%,
1,369,0.0%,
2,824,0.0%,
3,181,0.0%,
4,5,0.0%,

Value,Count,Frequency (%),Unnamed: 3
0,81,0.0%,
1,369,0.0%,
2,824,0.0%,
3,181,0.0%,
4,5,0.0%,

0,1
Distinct count,6
Unique (%),0.0%
Missing (%),100.0%
Missing (n),81

0,1
TA,1326
Fa,35
Gd,9
Other values (2),9
(Missing),81

Value,Count,Frequency (%),Unnamed: 3
TA,1326,0.0%,
Fa,35,0.0%,
Gd,9,0.0%,
Po,7,0.0%,
Ex,2,0.0%,
(Missing),81,0.0%,

0,1
Distinct count,4
Unique (%),0.0%
Missing (%),100.0%
Missing (n),81

0,1
Unf,605
RFn,422
Fin,352
(Missing),81

Value,Count,Frequency (%),Unnamed: 3
Unf,605,0.0%,
RFn,422,0.0%,
Fin,352,0.0%,
(Missing),81,0.0%,

0,1
Distinct count,6
Unique (%),0.0%
Missing (%),100.0%
Missing (n),81

0,1
TA,1311
Fa,48
Gd,14
Other values (2),6
(Missing),81

Value,Count,Frequency (%),Unnamed: 3
TA,1311,0.0%,
Fa,48,0.0%,
Gd,14,0.0%,
Ex,3,0.0%,
Po,3,0.0%,
(Missing),81,0.0%,

0,1
Distinct count,7
Unique (%),0.0%
Missing (%),100.0%
Missing (n),81

0,1
Attchd,870
Detchd,387
BuiltIn,88
Other values (3),34
(Missing),81

Value,Count,Frequency (%),Unnamed: 3
Attchd,870,0.0%,
Detchd,387,0.0%,
BuiltIn,88,0.0%,
Basment,19,0.0%,
CarPort,9,0.0%,
2Types,6,0.0%,
(Missing),81,0.0%,

0,1
Distinct count,98
Unique (%),0.0%
Missing (%),100.0%
Missing (n),81
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,1978.5
Minimum,1900
Maximum,2010
Zeros (%),0.0%

0,1
Minimum,1900
5-th percentile,1930
Q1,1961
Median,1980
Q3,2002
95-th percentile,2007
Maximum,2010
Range,110
Interquartile range,41

0,1
Standard deviation,24.69
Coef of variation,0.012479
Kurtosis,-0.41834
Mean,1978.5
MAD,20.913
Skewness,-0.64941
Sum,2728400
Variance,609.58
Memory size,11.5 KiB

Value,Count,Frequency (%),Unnamed: 3
2005.0,65,0.0%,
2006.0,59,0.0%,
2004.0,53,0.0%,
2003.0,50,0.0%,
2007.0,49,0.0%,
1977.0,35,0.0%,
1998.0,31,0.0%,
1999.0,30,0.0%,
1976.0,29,0.0%,
2008.0,29,0.0%,

Value,Count,Frequency (%),Unnamed: 3
1900.0,1,0.0%,
1906.0,1,0.0%,
1908.0,1,0.0%,
1910.0,3,0.0%,
1914.0,2,0.0%,

Value,Count,Frequency (%),Unnamed: 3
2006.0,59,0.0%,
2007.0,49,0.0%,
2008.0,29,0.0%,
2009.0,21,0.0%,
2010.0,3,0.0%,

0,1
Distinct count,861
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,1515.5
Minimum,334
Maximum,5642
Zeros (%),0.0%

0,1
Minimum,334.0
5-th percentile,848.0
Q1,1129.5
Median,1464.0
Q3,1776.8
95-th percentile,2466.1
Maximum,5642.0
Range,5308.0
Interquartile range,647.25

0,1
Standard deviation,525.48
Coef of variation,0.34675
Kurtosis,4.8951
Mean,1515.5
MAD,397.32
Skewness,1.3666
Sum,2212577
Variance,276130
Memory size,11.5 KiB

Value,Count,Frequency (%),Unnamed: 3
864,22,0.0%,
1040,14,0.0%,
894,11,0.0%,
848,10,0.0%,
1456,10,0.0%,
912,9,0.0%,
1200,9,0.0%,
816,8,0.0%,
1092,8,0.0%,
1344,7,0.0%,

Value,Count,Frequency (%),Unnamed: 3
334,1,0.0%,
438,1,0.0%,
480,1,0.0%,
520,1,0.0%,
605,1,0.0%,

Value,Count,Frequency (%),Unnamed: 3
3627,1,0.0%,
4316,1,0.0%,
4476,1,0.0%,
4676,1,0.0%,
5642,1,0.0%,

0,1
Distinct count,3
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,0.38288
Minimum,0
Maximum,2
Zeros (%),0.0%

0,1
Minimum,0
5-th percentile,0
Q1,0
Median,0
Q3,1
95-th percentile,1
Maximum,2
Range,2
Interquartile range,1

0,1
Standard deviation,0.50289
Coef of variation,1.3134
Kurtosis,-1.0769
Mean,0.38288
MAD,0.47886
Skewness,0.6759
Sum,559
Variance,0.25289
Memory size,11.5 KiB

Value,Count,Frequency (%),Unnamed: 3
0,913,0.0%,
1,535,0.0%,
2,12,0.0%,

Value,Count,Frequency (%),Unnamed: 3
0,913,0.0%,
1,535,0.0%,
2,12,0.0%,

Value,Count,Frequency (%),Unnamed: 3
0,913,0.0%,
1,535,0.0%,
2,12,0.0%,

0,1
Distinct count,6
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
GasA,1428
GasW,18
Grav,7
Other values (3),7

Value,Count,Frequency (%),Unnamed: 3
GasA,1428,0.0%,
GasW,18,0.0%,
Grav,7,0.0%,
Wall,4,0.0%,
OthW,2,0.0%,
Floor,1,0.0%,

0,1
Distinct count,5
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Ex,741
TA,428
Gd,241
Other values (2),50

Value,Count,Frequency (%),Unnamed: 3
Ex,741,0.0%,
TA,428,0.0%,
Gd,241,0.0%,
Fa,49,0.0%,
Po,1,0.0%,

0,1
Distinct count,8
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
1Story,726
2Story,445
1.5Fin,154
Other values (5),135

Value,Count,Frequency (%),Unnamed: 3
1Story,726,0.0%,
2Story,445,0.0%,
1.5Fin,154,0.0%,
SLvl,65,0.0%,
SFoyer,37,0.0%,
1.5Unf,14,0.0%,
2.5Unf,11,0.0%,
2.5Fin,8,0.0%,

0,1
Distinct count,1460
Unique (%),100.0%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,730.5
Minimum,1
Maximum,1460
Zeros (%),0.0%

0,1
Minimum,1.0
5-th percentile,73.95
Q1,365.75
Median,730.5
Q3,1095.2
95-th percentile,1387.0
Maximum,1460.0
Range,1459.0
Interquartile range,729.5

0,1
Standard deviation,421.61
Coef of variation,0.57715
Kurtosis,-1.2
Mean,730.5
MAD,365
Skewness,0
Sum,1066530
Variance,177760
Memory size,11.5 KiB

Value,Count,Frequency (%),Unnamed: 3
1460,1,0.0%,
479,1,0.0%,
481,1,0.0%,
482,1,0.0%,
483,1,0.0%,
484,1,0.0%,
485,1,0.0%,
486,1,0.0%,
487,1,0.0%,
488,1,0.0%,

Value,Count,Frequency (%),Unnamed: 3
1,1,0.0%,
2,1,0.0%,
3,1,0.0%,
4,1,0.0%,
5,1,0.0%,

Value,Count,Frequency (%),Unnamed: 3
1456,1,0.0%,
1457,1,0.0%,
1458,1,0.0%,
1459,1,0.0%,
1460,1,0.0%,

0,1
Distinct count,4
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,1.0466
Minimum,0
Maximum,3
Zeros (%),0.0%

0,1
Minimum,0
5-th percentile,1
Q1,1
Median,1
Q3,1
95-th percentile,1
Maximum,3
Range,3
Interquartile range,0

0,1
Standard deviation,0.22034
Coef of variation,0.21053
Kurtosis,21.532
Mean,1.0466
MAD,0.090246
Skewness,4.4884
Sum,1528
Variance,0.048549
Memory size,11.5 KiB

Value,Count,Frequency (%),Unnamed: 3
1,1392,0.0%,
2,65,0.0%,
3,2,0.0%,
0,1,0.0%,

Value,Count,Frequency (%),Unnamed: 3
0,1,0.0%,
1,1392,0.0%,
2,65,0.0%,
3,2,0.0%,

Value,Count,Frequency (%),Unnamed: 3
0,1,0.0%,
1,1392,0.0%,
2,65,0.0%,
3,2,0.0%,

0,1
Distinct count,4
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
TA,735
Gd,586
Ex,100

Value,Count,Frequency (%),Unnamed: 3
TA,735,0.0%,
Gd,586,0.0%,
Ex,100,0.0%,
Fa,39,0.0%,

0,1
Distinct count,4
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Lvl,1311
Bnk,63
HLS,50

Value,Count,Frequency (%),Unnamed: 3
Lvl,1311,0.0%,
Bnk,63,0.0%,
HLS,50,0.0%,
Low,36,0.0%,

0,1
Distinct count,3
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Gtl,1382
Mod,65
Sev,13

Value,Count,Frequency (%),Unnamed: 3
Gtl,1382,0.0%,
Mod,65,0.0%,
Sev,13,0.0%,

0,1
Distinct count,1073
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,10517
Minimum,1300
Maximum,215245
Zeros (%),0.0%

0,1
Minimum,1300.0
5-th percentile,3311.7
Q1,7553.5
Median,9478.5
Q3,11602.0
95-th percentile,17401.0
Maximum,215245.0
Range,213945.0
Interquartile range,4048.0

0,1
Standard deviation,9981.3
Coef of variation,0.94908
Kurtosis,203.24
Mean,10517
MAD,3758.8
Skewness,12.208
Sum,15354569
Variance,99626000
Memory size,11.5 KiB

Value,Count,Frequency (%),Unnamed: 3
7200,25,0.0%,
9600,24,0.0%,
6000,17,0.0%,
10800,14,0.0%,
9000,14,0.0%,
8400,14,0.0%,
1680,10,0.0%,
7500,9,0.0%,
8125,8,0.0%,
9100,8,0.0%,

Value,Count,Frequency (%),Unnamed: 3
1300,1,0.0%,
1477,1,0.0%,
1491,1,0.0%,
1526,1,0.0%,
1533,2,0.0%,

Value,Count,Frequency (%),Unnamed: 3
70761,1,0.0%,
115149,1,0.0%,
159000,1,0.0%,
164660,1,0.0%,
215245,1,0.0%,

0,1
Distinct count,5
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Inside,1052
Corner,263
CulDSac,94
Other values (2),51

Value,Count,Frequency (%),Unnamed: 3
Inside,1052,0.0%,
Corner,263,0.0%,
CulDSac,94,0.0%,
FR2,47,0.0%,
FR3,4,0.0%,

0,1
Distinct count,111
Unique (%),0.0%
Missing (%),100.0%
Missing (n),259
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,70.05
Minimum,21
Maximum,313
Zeros (%),0.0%

0,1
Minimum,21
5-th percentile,34
Q1,59
Median,69
Q3,80
95-th percentile,107
Maximum,313
Range,292
Interquartile range,21

0,1
Standard deviation,24.285
Coef of variation,0.34668
Kurtosis,17.453
Mean,70.05
MAD,16.762
Skewness,2.1636
Sum,84130
Variance,589.75
Memory size,11.5 KiB

Value,Count,Frequency (%),Unnamed: 3
60.0,143,0.0%,
70.0,70,0.0%,
80.0,69,0.0%,
50.0,57,0.0%,
75.0,53,0.0%,
65.0,44,0.0%,
85.0,40,0.0%,
78.0,25,0.0%,
21.0,23,0.0%,
90.0,23,0.0%,

Value,Count,Frequency (%),Unnamed: 3
21.0,23,0.0%,
24.0,19,0.0%,
30.0,6,0.0%,
32.0,5,0.0%,
33.0,1,0.0%,

Value,Count,Frequency (%),Unnamed: 3
160.0,1,0.0%,
168.0,1,0.0%,
174.0,2,0.0%,
182.0,1,0.0%,
313.0,2,0.0%,

0,1
Distinct count,4
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Reg,925
IR1,484
IR2,41

Value,Count,Frequency (%),Unnamed: 3
Reg,925,0.0%,
IR1,484,0.0%,
IR2,41,0.0%,
IR3,10,0.0%,

0,1
Distinct count,24
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,5.8445
Minimum,0
Maximum,572
Zeros (%),0.0%

0,1
Minimum,0
5-th percentile,0
Q1,0
Median,0
Q3,0
95-th percentile,0
Maximum,572
Range,572
Interquartile range,0

0,1
Standard deviation,48.623
Coef of variation,8.3194
Kurtosis,83.235
Mean,5.8445
MAD,11.481
Skewness,9.0113
Sum,8533
Variance,2364.2
Memory size,11.5 KiB

Value,Count,Frequency (%),Unnamed: 3
0,1434,0.0%,
80,3,0.0%,
360,2,0.0%,
528,1,0.0%,
53,1,0.0%,
120,1,0.0%,
144,1,0.0%,
156,1,0.0%,
205,1,0.0%,
232,1,0.0%,

Value,Count,Frequency (%),Unnamed: 3
0,1434,0.0%,
53,1,0.0%,
80,3,0.0%,
120,1,0.0%,
144,1,0.0%,

Value,Count,Frequency (%),Unnamed: 3
513,1,0.0%,
514,1,0.0%,
515,1,0.0%,
528,1,0.0%,
572,1,0.0%,

0,1
Distinct count,15
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,56.897
Minimum,20
Maximum,190
Zeros (%),0.0%

0,1
Minimum,20
5-th percentile,20
Q1,20
Median,50
Q3,70
95-th percentile,160
Maximum,190
Range,170
Interquartile range,50

0,1
Standard deviation,42.301
Coef of variation,0.74346
Kurtosis,1.5802
Mean,56.897
MAD,31.283
Skewness,1.4077
Sum,83070
Variance,1789.3
Memory size,11.5 KiB

Value,Count,Frequency (%),Unnamed: 3
20,536,0.0%,
60,299,0.0%,
50,144,0.0%,
120,87,0.0%,
30,69,0.0%,
160,63,0.0%,
70,60,0.0%,
80,58,0.0%,
90,52,0.0%,
190,30,0.0%,

Value,Count,Frequency (%),Unnamed: 3
20,536,0.0%,
30,69,0.0%,
40,4,0.0%,
45,12,0.0%,
50,144,0.0%,

Value,Count,Frequency (%),Unnamed: 3
90,52,0.0%,
120,87,0.0%,
160,63,0.0%,
180,10,0.0%,
190,30,0.0%,

0,1
Distinct count,5
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
RL,1151
RM,218
FV,65
Other values (2),26

Value,Count,Frequency (%),Unnamed: 3
RL,1151,0.0%,
RM,218,0.0%,
FV,65,0.0%,
RH,16,0.0%,
C (all),10,0.0%,

0,1
Distinct count,328
Unique (%),0.0%
Missing (%),100.0%
Missing (n),8
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,103.69
Minimum,0
Maximum,1600
Zeros (%),0.0%

0,1
Minimum,0
5-th percentile,0
Q1,0
Median,0
Q3,166
95-th percentile,456
Maximum,1600
Range,1600
Interquartile range,166

0,1
Standard deviation,181.07
Coef of variation,1.7463
Kurtosis,10.082
Mean,103.69
MAD,129.78
Skewness,2.6691
Sum,150550
Variance,32785
Memory size,11.5 KiB

Value,Count,Frequency (%),Unnamed: 3
0.0,861,0.0%,
72.0,8,0.0%,
180.0,8,0.0%,
108.0,8,0.0%,
120.0,7,0.0%,
16.0,7,0.0%,
106.0,6,0.0%,
80.0,6,0.0%,
340.0,6,0.0%,
200.0,6,0.0%,

Value,Count,Frequency (%),Unnamed: 3
0.0,861,0.0%,
1.0,2,0.0%,
11.0,1,0.0%,
14.0,1,0.0%,
16.0,7,0.0%,

Value,Count,Frequency (%),Unnamed: 3
1115.0,1,0.0%,
1129.0,1,0.0%,
1170.0,1,0.0%,
1378.0,1,0.0%,
1600.0,1,0.0%,

0,1
Distinct count,5
Unique (%),0.0%
Missing (%),100.0%
Missing (n),8

0,1
,864
BrkFace,445
Stone,128

Value,Count,Frequency (%),Unnamed: 3
,864,0.0%,
BrkFace,445,0.0%,
Stone,128,0.0%,
BrkCmn,15,0.0%,
(Missing),8,0.0%,

0,1
Distinct count,5
Unique (%),0.0%
Missing (%),100.0%
Missing (n),1406

0,1
Shed,49
Othr,2
Gar2,2
(Missing),1406

Value,Count,Frequency (%),Unnamed: 3
Shed,49,0.0%,
Othr,2,0.0%,
Gar2,2,0.0%,
TenC,1,0.0%,
(Missing),1406,0.0%,

0,1
Distinct count,21
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,43.489
Minimum,0
Maximum,15500
Zeros (%),0.0%

0,1
Minimum,0
5-th percentile,0
Q1,0
Median,0
Q3,0
95-th percentile,0
Maximum,15500
Range,15500
Interquartile range,0

0,1
Standard deviation,496.12
Coef of variation,11.408
Kurtosis,701
Mean,43.489
MAD,83.88
Skewness,24.477
Sum,63494
Variance,246140
Memory size,11.5 KiB

Value,Count,Frequency (%),Unnamed: 3
0,1408,0.0%,
400,11,0.0%,
500,8,0.0%,
700,5,0.0%,
450,4,0.0%,
2000,4,0.0%,
600,4,0.0%,
1200,2,0.0%,
480,2,0.0%,
1150,1,0.0%,

Value,Count,Frequency (%),Unnamed: 3
0,1408,0.0%,
54,1,0.0%,
350,1,0.0%,
400,11,0.0%,
450,4,0.0%,

Value,Count,Frequency (%),Unnamed: 3
2000,4,0.0%,
2500,1,0.0%,
3500,1,0.0%,
8300,1,0.0%,
15500,1,0.0%,

0,1
Distinct count,12
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,6.3219
Minimum,1
Maximum,12
Zeros (%),0.0%

0,1
Minimum,1
5-th percentile,2
Q1,5
Median,6
Q3,8
95-th percentile,11
Maximum,12
Range,11
Interquartile range,3

0,1
Standard deviation,2.7036
Coef of variation,0.42766
Kurtosis,-0.40411
Mean,6.3219
MAD,2.1425
Skewness,0.21205
Sum,9230
Variance,7.3096
Memory size,11.5 KiB

Value,Count,Frequency (%),Unnamed: 3
6,253,0.0%,
7,234,0.0%,
5,204,0.0%,
4,141,0.0%,
8,122,0.0%,
3,106,0.0%,
10,89,0.0%,
11,79,0.0%,
9,63,0.0%,
12,59,0.0%,

Value,Count,Frequency (%),Unnamed: 3
1,58,0.0%,
2,52,0.0%,
3,106,0.0%,
4,141,0.0%,
5,204,0.0%,

Value,Count,Frequency (%),Unnamed: 3
8,122,0.0%,
9,63,0.0%,
10,89,0.0%,
11,79,0.0%,
12,59,0.0%,

0,1
Distinct count,25
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
NAmes,225
CollgCr,150
OldTown,113
Other values (22),972

Value,Count,Frequency (%),Unnamed: 3
NAmes,225,0.0%,
CollgCr,150,0.0%,
OldTown,113,0.0%,
Edwards,100,0.0%,
Somerst,86,0.0%,
Gilbert,79,0.0%,
NridgHt,77,0.0%,
Sawyer,74,0.0%,
NWAmes,73,0.0%,
SawyerW,59,0.0%,

0,1
Distinct count,202
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,46.66
Minimum,0
Maximum,547
Zeros (%),0.0%

0,1
Minimum,0.0
5-th percentile,0.0
Q1,0.0
Median,25.0
Q3,68.0
95-th percentile,175.05
Maximum,547.0
Range,547.0
Interquartile range,68.0

0,1
Standard deviation,66.256
Coef of variation,1.42
Kurtosis,8.4903
Mean,46.66
MAD,47.678
Skewness,2.3643
Sum,68124
Variance,4389.9
Memory size,11.5 KiB

Value,Count,Frequency (%),Unnamed: 3
0,656,0.0%,
36,29,0.0%,
48,22,0.0%,
20,21,0.0%,
40,19,0.0%,
45,19,0.0%,
30,16,0.0%,
24,16,0.0%,
60,15,0.0%,
39,14,0.0%,

Value,Count,Frequency (%),Unnamed: 3
0,656,0.0%,
4,1,0.0%,
8,1,0.0%,
10,1,0.0%,
11,1,0.0%,

Value,Count,Frequency (%),Unnamed: 3
406,1,0.0%,
418,1,0.0%,
502,1,0.0%,
523,1,0.0%,
547,1,0.0%,

0,1
Distinct count,9
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,5.5753
Minimum,1
Maximum,9
Zeros (%),0.0%

0,1
Minimum,1
5-th percentile,4
Q1,5
Median,5
Q3,6
95-th percentile,8
Maximum,9
Range,8
Interquartile range,1

0,1
Standard deviation,1.1128
Coef of variation,0.19959
Kurtosis,1.1064
Mean,5.5753
MAD,0.88902
Skewness,0.69307
Sum,8140
Variance,1.2383
Memory size,11.5 KiB

Value,Count,Frequency (%),Unnamed: 3
5,821,0.0%,
6,252,0.0%,
7,205,0.0%,
8,72,0.0%,
4,57,0.0%,
3,25,0.0%,
9,22,0.0%,
2,5,0.0%,
1,1,0.0%,

Value,Count,Frequency (%),Unnamed: 3
1,1,0.0%,
2,5,0.0%,
3,25,0.0%,
4,57,0.0%,
5,821,0.0%,

Value,Count,Frequency (%),Unnamed: 3
5,821,0.0%,
6,252,0.0%,
7,205,0.0%,
8,72,0.0%,
9,22,0.0%,

0,1
Distinct count,10
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,6.0993
Minimum,1
Maximum,10
Zeros (%),0.0%

0,1
Minimum,1
5-th percentile,4
Q1,5
Median,6
Q3,7
95-th percentile,8
Maximum,10
Range,9
Interquartile range,2

0,1
Standard deviation,1.383
Coef of variation,0.22675
Kurtosis,0.096293
Mean,6.0993
MAD,1.098
Skewness,0.21694
Sum,8905
Variance,1.9127
Memory size,11.5 KiB

Value,Count,Frequency (%),Unnamed: 3
5,397,0.0%,
6,374,0.0%,
7,319,0.0%,
8,168,0.0%,
4,116,0.0%,
9,43,0.0%,
3,20,0.0%,
10,18,0.0%,
2,3,0.0%,
1,2,0.0%,

Value,Count,Frequency (%),Unnamed: 3
1,2,0.0%,
2,3,0.0%,
3,20,0.0%,
4,116,0.0%,
5,397,0.0%,

Value,Count,Frequency (%),Unnamed: 3
6,374,0.0%,
7,319,0.0%,
8,168,0.0%,
9,43,0.0%,
10,18,0.0%,

0,1
Distinct count,3
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Y,1340
N,90
P,30

Value,Count,Frequency (%),Unnamed: 3
Y,1340,0.0%,
N,90,0.0%,
P,30,0.0%,

0,1
Distinct count,8
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,2.7589
Minimum,0
Maximum,738
Zeros (%),0.0%

0,1
Minimum,0
5-th percentile,0
Q1,0
Median,0
Q3,0
95-th percentile,0
Maximum,738
Range,738
Interquartile range,0

0,1
Standard deviation,40.177
Coef of variation,14.563
Kurtosis,223.27
Mean,2.7589
MAD,5.4914
Skewness,14.828
Sum,4028
Variance,1614.2
Memory size,11.5 KiB

Value,Count,Frequency (%),Unnamed: 3
0,1453,0.0%,
738,1,0.0%,
648,1,0.0%,
576,1,0.0%,
555,1,0.0%,
519,1,0.0%,
512,1,0.0%,
480,1,0.0%,

Value,Count,Frequency (%),Unnamed: 3
0,1453,0.0%,
480,1,0.0%,
512,1,0.0%,
519,1,0.0%,
555,1,0.0%,

Value,Count,Frequency (%),Unnamed: 3
519,1,0.0%,
555,1,0.0%,
576,1,0.0%,
648,1,0.0%,
738,1,0.0%,

0,1
Distinct count,4
Unique (%),0.0%
Missing (%),100.0%
Missing (n),1453

0,1
Gd,3
Ex,2
Fa,2
(Missing),1453

Value,Count,Frequency (%),Unnamed: 3
Gd,3,0.0%,
Ex,2,0.0%,
Fa,2,0.0%,
(Missing),1453,0.0%,

0,1
Distinct count,8
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
CompShg,1434
Tar&Grv,11
WdShngl,6
Other values (5),9

Value,Count,Frequency (%),Unnamed: 3
CompShg,1434,0.0%,
Tar&Grv,11,0.0%,
WdShngl,6,0.0%,
WdShake,5,0.0%,
Membran,1,0.0%,
Metal,1,0.0%,
ClyTile,1,0.0%,
Roll,1,0.0%,

0,1
Distinct count,6
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Gable,1141
Hip,286
Flat,13
Other values (3),20

Value,Count,Frequency (%),Unnamed: 3
Gable,1141,0.0%,
Hip,286,0.0%,
Flat,13,0.0%,
Gambrel,11,0.0%,
Mansard,7,0.0%,
Shed,2,0.0%,

0,1
Distinct count,6
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Normal,1198
Partial,125
Abnorml,101
Other values (3),36

Value,Count,Frequency (%),Unnamed: 3
Normal,1198,0.0%,
Partial,125,0.0%,
Abnorml,101,0.0%,
Family,20,0.0%,
Alloca,12,0.0%,
AdjLand,4,0.0%,

0,1
Distinct count,663
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,180920
Minimum,34900
Maximum,755000
Zeros (%),0.0%

0,1
Minimum,34900
5-th percentile,88000
Q1,129980
Median,163000
Q3,214000
95-th percentile,326100
Maximum,755000
Range,720100
Interquartile range,84025

0,1
Standard deviation,79443
Coef of variation,0.4391
Kurtosis,6.5363
Mean,180920
MAD,57435
Skewness,1.8829
Sum,264144946
Variance,6311100000
Memory size,11.5 KiB

Value,Count,Frequency (%),Unnamed: 3
140000,20,0.0%,
135000,17,0.0%,
145000,14,0.0%,
155000,14,0.0%,
190000,13,0.0%,
110000,13,0.0%,
160000,12,0.0%,
115000,12,0.0%,
139000,11,0.0%,
130000,11,0.0%,

Value,Count,Frequency (%),Unnamed: 3
34900,1,0.0%,
35311,1,0.0%,
37900,1,0.0%,
39300,1,0.0%,
40000,1,0.0%,

Value,Count,Frequency (%),Unnamed: 3
582933,1,0.0%,
611657,1,0.0%,
625000,1,0.0%,
745000,1,0.0%,
755000,1,0.0%,

0,1
Distinct count,9
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
WD,1267
New,122
COD,43
Other values (6),28

Value,Count,Frequency (%),Unnamed: 3
WD,1267,0.0%,
New,122,0.0%,
COD,43,0.0%,
ConLD,9,0.0%,
ConLw,5,0.0%,
ConLI,5,0.0%,
CWD,4,0.0%,
Oth,3,0.0%,
Con,2,0.0%,

0,1
Distinct count,76
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,15.061
Minimum,0
Maximum,480
Zeros (%),0.0%

0,1
Minimum,0
5-th percentile,0
Q1,0
Median,0
Q3,0
95-th percentile,160
Maximum,480
Range,480
Interquartile range,0

0,1
Standard deviation,55.757
Coef of variation,3.7021
Kurtosis,18.439
Mean,15.061
MAD,27.729
Skewness,4.1222
Sum,21989
Variance,3108.9
Memory size,11.5 KiB

Value,Count,Frequency (%),Unnamed: 3
0,1344,0.0%,
192,6,0.0%,
224,5,0.0%,
120,5,0.0%,
189,4,0.0%,
180,4,0.0%,
160,3,0.0%,
168,3,0.0%,
144,3,0.0%,
126,3,0.0%,

Value,Count,Frequency (%),Unnamed: 3
0,1344,0.0%,
40,1,0.0%,
53,1,0.0%,
60,1,0.0%,
63,1,0.0%,

Value,Count,Frequency (%),Unnamed: 3
385,1,0.0%,
396,1,0.0%,
410,1,0.0%,
440,1,0.0%,
480,1,0.0%,

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Pave,1454
Grvl,6

Value,Count,Frequency (%),Unnamed: 3
Pave,1454,0.0%,
Grvl,6,0.0%,

0,1
Distinct count,12
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,6.5178
Minimum,2
Maximum,14
Zeros (%),0.0%

0,1
Minimum,2
5-th percentile,4
Q1,5
Median,6
Q3,7
95-th percentile,10
Maximum,14
Range,12
Interquartile range,2

0,1
Standard deviation,1.6254
Coef of variation,0.24938
Kurtosis,0.88076
Mean,6.5178
MAD,1.2796
Skewness,0.67634
Sum,9516
Variance,2.6419
Memory size,11.5 KiB

Value,Count,Frequency (%),Unnamed: 3
6,402,0.0%,
7,329,0.0%,
5,275,0.0%,
8,187,0.0%,
4,97,0.0%,
9,75,0.0%,
10,47,0.0%,
11,18,0.0%,
3,17,0.0%,
12,11,0.0%,

Value,Count,Frequency (%),Unnamed: 3
2,1,0.0%,
3,17,0.0%,
4,97,0.0%,
5,275,0.0%,
6,402,0.0%,

Value,Count,Frequency (%),Unnamed: 3
9,75,0.0%,
10,47,0.0%,
11,18,0.0%,
12,11,0.0%,
14,1,0.0%,

0,1
Distinct count,721
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,1057.4
Minimum,0
Maximum,6110
Zeros (%),0.0%

0,1
Minimum,0.0
5-th percentile,519.3
Q1,795.75
Median,991.5
Q3,1298.2
95-th percentile,1753.0
Maximum,6110.0
Range,6110.0
Interquartile range,502.5

0,1
Standard deviation,438.71
Coef of variation,0.41488
Kurtosis,13.25
Mean,1057.4
MAD,321.28
Skewness,1.5243
Sum,1543847
Variance,192460
Memory size,11.5 KiB

Value,Count,Frequency (%),Unnamed: 3
0,37,0.0%,
864,35,0.0%,
672,17,0.0%,
912,15,0.0%,
1040,14,0.0%,
816,13,0.0%,
728,12,0.0%,
768,12,0.0%,
848,11,0.0%,
780,11,0.0%,

Value,Count,Frequency (%),Unnamed: 3
0,37,0.0%,
105,1,0.0%,
190,1,0.0%,
264,3,0.0%,
270,1,0.0%,

Value,Count,Frequency (%),Unnamed: 3
3094,1,0.0%,
3138,1,0.0%,
3200,1,0.0%,
3206,1,0.0%,
6110,1,0.0%,

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
AllPub,1459
NoSeWa,1

Value,Count,Frequency (%),Unnamed: 3
AllPub,1459,0.0%,
NoSeWa,1,0.0%,

0,1
Distinct count,274
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,94.245
Minimum,0
Maximum,857
Zeros (%),0.0%

0,1
Minimum,0
5-th percentile,0
Q1,0
Median,0
Q3,168
95-th percentile,335
Maximum,857
Range,857
Interquartile range,168

0,1
Standard deviation,125.34
Coef of variation,1.3299
Kurtosis,2.993
Mean,94.245
MAD,102
Skewness,1.5414
Sum,137597
Variance,15710
Memory size,11.5 KiB

Value,Count,Frequency (%),Unnamed: 3
0,761,0.0%,
192,38,0.0%,
100,36,0.0%,
144,33,0.0%,
120,31,0.0%,
168,28,0.0%,
140,15,0.0%,
224,14,0.0%,
240,10,0.0%,
208,10,0.0%,

Value,Count,Frequency (%),Unnamed: 3
0,761,0.0%,
12,2,0.0%,
24,2,0.0%,
26,2,0.0%,
28,2,0.0%,

Value,Count,Frequency (%),Unnamed: 3
668,1,0.0%,
670,1,0.0%,
728,1,0.0%,
736,1,0.0%,
857,1,0.0%,

0,1
Distinct count,112
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,1971.3
Minimum,1872
Maximum,2010
Zeros (%),0.0%

0,1
Minimum,1872
5-th percentile,1916
Q1,1954
Median,1973
Q3,2000
95-th percentile,2007
Maximum,2010
Range,138
Interquartile range,46

0,1
Standard deviation,30.203
Coef of variation,0.015322
Kurtosis,-0.43955
Mean,1971.3
MAD,25.067
Skewness,-0.61346
Sum,2878051
Variance,912.22
Memory size,11.5 KiB

Value,Count,Frequency (%),Unnamed: 3
2006,67,0.0%,
2005,64,0.0%,
2004,54,0.0%,
2007,49,0.0%,
2003,45,0.0%,
1976,33,0.0%,
1977,32,0.0%,
1920,30,0.0%,
1959,26,0.0%,
1999,25,0.0%,

Value,Count,Frequency (%),Unnamed: 3
1872,1,0.0%,
1875,1,0.0%,
1880,4,0.0%,
1882,1,0.0%,
1885,2,0.0%,

Value,Count,Frequency (%),Unnamed: 3
2006,67,0.0%,
2007,49,0.0%,
2008,23,0.0%,
2009,18,0.0%,
2010,1,0.0%,

0,1
Distinct count,61
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,1984.9
Minimum,1950
Maximum,2010
Zeros (%),0.0%

0,1
Minimum,1950
5-th percentile,1950
Q1,1967
Median,1994
Q3,2004
95-th percentile,2007
Maximum,2010
Range,60
Interquartile range,37

0,1
Standard deviation,20.645
Coef of variation,0.010401
Kurtosis,-1.2722
Mean,1984.9
MAD,18.623
Skewness,-0.50356
Sum,2897904
Variance,426.23
Memory size,11.5 KiB

Value,Count,Frequency (%),Unnamed: 3
1950,178,0.0%,
2006,97,0.0%,
2007,76,0.0%,
2005,73,0.0%,
2004,62,0.0%,
2000,55,0.0%,
2003,51,0.0%,
2002,48,0.0%,
2008,40,0.0%,
1996,36,0.0%,

Value,Count,Frequency (%),Unnamed: 3
1950,178,0.0%,
1951,4,0.0%,
1952,5,0.0%,
1953,10,0.0%,
1954,14,0.0%,

Value,Count,Frequency (%),Unnamed: 3
2006,97,0.0%,
2007,76,0.0%,
2008,40,0.0%,
2009,23,0.0%,
2010,6,0.0%,

0,1
Distinct count,5
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,2007.8
Minimum,2006
Maximum,2010
Zeros (%),0.0%

0,1
Minimum,2006
5-th percentile,2006
Q1,2007
Median,2008
Q3,2009
95-th percentile,2010
Maximum,2010
Range,4
Interquartile range,2

0,1
Standard deviation,1.3281
Coef of variation,0.00066146
Kurtosis,-1.1906
Mean,2007.8
MAD,1.1487
Skewness,0.096269
Sum,2931411
Variance,1.7638
Memory size,11.5 KiB

Value,Count,Frequency (%),Unnamed: 3
2009,338,0.0%,
2007,329,0.0%,
2006,314,0.0%,
2008,304,0.0%,
2010,175,0.0%,

Value,Count,Frequency (%),Unnamed: 3
2006,314,0.0%,
2007,329,0.0%,
2008,304,0.0%,
2009,338,0.0%,
2010,175,0.0%,

Value,Count,Frequency (%),Unnamed: 3
2006,314,0.0%,
2007,329,0.0%,
2008,304,0.0%,
2009,338,0.0%,
2010,175,0.0%,

Unnamed: 0_level_0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,Condition1,Condition2,BldgType,HouseStyle,OverallQual,OverallCond,YearBuilt,YearRemodAdd,RoofStyle,RoofMatl,Exterior1st,Exterior2nd,MasVnrType,MasVnrArea,ExterQual,ExterCond,Foundation,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinSF1,BsmtFinType2,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,Heating,HeatingQC,CentralAir,Electrical,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,KitchenQual,TotRmsAbvGrd,Functional,Fireplaces,FireplaceQu,GarageType,GarageYrBlt,GarageFinish,GarageCars,GarageArea,GarageQual,GarageCond,PavedDrive,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1
1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2003,2003,Gable,CompShg,VinylSd,VinylSd,BrkFace,196.0,Gd,TA,PConc,Gd,TA,No,GLQ,706,Unf,0,150,856,GasA,Ex,Y,SBrkr,856,854,0,1710,1,0,2,1,3,1,Gd,8,Typ,0,,Attchd,2003.0,RFn,2,548,TA,TA,Y,0,61,0,0,0,0,,,,0,2,2008,WD,Normal,208500
2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,FR2,Gtl,Veenker,Feedr,Norm,1Fam,1Story,6,8,1976,1976,Gable,CompShg,MetalSd,MetalSd,,0.0,TA,TA,CBlock,Gd,TA,Gd,ALQ,978,Unf,0,284,1262,GasA,Ex,Y,SBrkr,1262,0,0,1262,0,1,2,0,3,1,TA,6,Typ,1,TA,Attchd,1976.0,RFn,2,460,TA,TA,Y,298,0,0,0,0,0,,,,0,5,2007,WD,Normal,181500
3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2001,2002,Gable,CompShg,VinylSd,VinylSd,BrkFace,162.0,Gd,TA,PConc,Gd,TA,Mn,GLQ,486,Unf,0,434,920,GasA,Ex,Y,SBrkr,920,866,0,1786,1,0,2,1,3,1,Gd,6,Typ,1,TA,Attchd,2001.0,RFn,2,608,TA,TA,Y,0,42,0,0,0,0,,,,0,9,2008,WD,Normal,223500
4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,Corner,Gtl,Crawfor,Norm,Norm,1Fam,2Story,7,5,1915,1970,Gable,CompShg,Wd Sdng,Wd Shng,,0.0,TA,TA,BrkTil,TA,Gd,No,ALQ,216,Unf,0,540,756,GasA,Gd,Y,SBrkr,961,756,0,1717,1,0,1,0,3,1,Gd,7,Typ,1,Gd,Detchd,1998.0,Unf,3,642,TA,TA,Y,0,35,272,0,0,0,,,,0,2,2006,WD,Abnorml,140000
5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,FR2,Gtl,NoRidge,Norm,Norm,1Fam,2Story,8,5,2000,2000,Gable,CompShg,VinylSd,VinylSd,BrkFace,350.0,Gd,TA,PConc,Gd,TA,Av,GLQ,655,Unf,0,490,1145,GasA,Ex,Y,SBrkr,1145,1053,0,2198,1,0,2,1,4,1,Gd,9,Typ,1,TA,Attchd,2000.0,RFn,3,836,TA,TA,Y,192,84,0,0,0,0,,,,0,12,2008,WD,Normal,250000


### SKLEARN

##### Transformacje
* `fit(X, y=None, **fit_params)`
* `transform(X, y=None, **fit_params)`
* `fit_transform(X, y=None, **fit_params)`

In [158]:
train = pd.read_csv('train.csv', index_col='Id')
y = train.pop('SalePrice')

test = pd.read_csv('test.csv', index_col='Id')
print(train.shape)
train.head(10)

(1460, 79)


Unnamed: 0_level_0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,Condition1,Condition2,BldgType,HouseStyle,OverallQual,OverallCond,YearBuilt,YearRemodAdd,RoofStyle,RoofMatl,Exterior1st,Exterior2nd,MasVnrType,MasVnrArea,ExterQual,ExterCond,Foundation,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinSF1,BsmtFinType2,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,Heating,HeatingQC,CentralAir,Electrical,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,KitchenQual,TotRmsAbvGrd,Functional,Fireplaces,FireplaceQu,GarageType,GarageYrBlt,GarageFinish,GarageCars,GarageArea,GarageQual,GarageCond,PavedDrive,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1
1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2003,2003,Gable,CompShg,VinylSd,VinylSd,BrkFace,196.0,Gd,TA,PConc,Gd,TA,No,GLQ,706,Unf,0,150,856,GasA,Ex,Y,SBrkr,856,854,0,1710,1,0,2,1,3,1,Gd,8,Typ,0,,Attchd,2003.0,RFn,2,548,TA,TA,Y,0,61,0,0,0,0,,,,0,2,2008,WD,Normal
2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,FR2,Gtl,Veenker,Feedr,Norm,1Fam,1Story,6,8,1976,1976,Gable,CompShg,MetalSd,MetalSd,,0.0,TA,TA,CBlock,Gd,TA,Gd,ALQ,978,Unf,0,284,1262,GasA,Ex,Y,SBrkr,1262,0,0,1262,0,1,2,0,3,1,TA,6,Typ,1,TA,Attchd,1976.0,RFn,2,460,TA,TA,Y,298,0,0,0,0,0,,,,0,5,2007,WD,Normal
3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2001,2002,Gable,CompShg,VinylSd,VinylSd,BrkFace,162.0,Gd,TA,PConc,Gd,TA,Mn,GLQ,486,Unf,0,434,920,GasA,Ex,Y,SBrkr,920,866,0,1786,1,0,2,1,3,1,Gd,6,Typ,1,TA,Attchd,2001.0,RFn,2,608,TA,TA,Y,0,42,0,0,0,0,,,,0,9,2008,WD,Normal
4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,Corner,Gtl,Crawfor,Norm,Norm,1Fam,2Story,7,5,1915,1970,Gable,CompShg,Wd Sdng,Wd Shng,,0.0,TA,TA,BrkTil,TA,Gd,No,ALQ,216,Unf,0,540,756,GasA,Gd,Y,SBrkr,961,756,0,1717,1,0,1,0,3,1,Gd,7,Typ,1,Gd,Detchd,1998.0,Unf,3,642,TA,TA,Y,0,35,272,0,0,0,,,,0,2,2006,WD,Abnorml
5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,FR2,Gtl,NoRidge,Norm,Norm,1Fam,2Story,8,5,2000,2000,Gable,CompShg,VinylSd,VinylSd,BrkFace,350.0,Gd,TA,PConc,Gd,TA,Av,GLQ,655,Unf,0,490,1145,GasA,Ex,Y,SBrkr,1145,1053,0,2198,1,0,2,1,4,1,Gd,9,Typ,1,TA,Attchd,2000.0,RFn,3,836,TA,TA,Y,192,84,0,0,0,0,,,,0,12,2008,WD,Normal
6,50,RL,85.0,14115,Pave,,IR1,Lvl,AllPub,Inside,Gtl,Mitchel,Norm,Norm,1Fam,1.5Fin,5,5,1993,1995,Gable,CompShg,VinylSd,VinylSd,,0.0,TA,TA,Wood,Gd,TA,No,GLQ,732,Unf,0,64,796,GasA,Ex,Y,SBrkr,796,566,0,1362,1,0,1,1,1,1,TA,5,Typ,0,,Attchd,1993.0,Unf,2,480,TA,TA,Y,40,30,0,320,0,0,,MnPrv,Shed,700,10,2009,WD,Normal
7,20,RL,75.0,10084,Pave,,Reg,Lvl,AllPub,Inside,Gtl,Somerst,Norm,Norm,1Fam,1Story,8,5,2004,2005,Gable,CompShg,VinylSd,VinylSd,Stone,186.0,Gd,TA,PConc,Ex,TA,Av,GLQ,1369,Unf,0,317,1686,GasA,Ex,Y,SBrkr,1694,0,0,1694,1,0,2,0,3,1,Gd,7,Typ,1,Gd,Attchd,2004.0,RFn,2,636,TA,TA,Y,255,57,0,0,0,0,,,,0,8,2007,WD,Normal
8,60,RL,,10382,Pave,,IR1,Lvl,AllPub,Corner,Gtl,NWAmes,PosN,Norm,1Fam,2Story,7,6,1973,1973,Gable,CompShg,HdBoard,HdBoard,Stone,240.0,TA,TA,CBlock,Gd,TA,Mn,ALQ,859,BLQ,32,216,1107,GasA,Ex,Y,SBrkr,1107,983,0,2090,1,0,2,1,3,1,TA,7,Typ,2,TA,Attchd,1973.0,RFn,2,484,TA,TA,Y,235,204,228,0,0,0,,,Shed,350,11,2009,WD,Normal
9,50,RM,51.0,6120,Pave,,Reg,Lvl,AllPub,Inside,Gtl,OldTown,Artery,Norm,1Fam,1.5Fin,7,5,1931,1950,Gable,CompShg,BrkFace,Wd Shng,,0.0,TA,TA,BrkTil,TA,TA,No,Unf,0,Unf,0,952,952,GasA,Gd,Y,FuseF,1022,752,0,1774,0,0,2,0,2,2,TA,8,Min1,2,TA,Detchd,1931.0,Unf,2,468,Fa,TA,Y,90,0,205,0,0,0,,,,0,4,2008,WD,Abnorml
10,190,RL,50.0,7420,Pave,,Reg,Lvl,AllPub,Corner,Gtl,BrkSide,Artery,Artery,2fmCon,1.5Unf,5,6,1939,1950,Gable,CompShg,MetalSd,MetalSd,,0.0,TA,TA,BrkTil,TA,TA,No,GLQ,851,Unf,0,140,991,GasA,Ex,Y,SBrkr,1077,0,0,1077,1,0,1,0,2,2,TA,5,Typ,2,TA,Attchd,1939.0,RFn,1,205,Gd,TA,Y,0,4,0,0,0,0,,,,0,1,2008,WD,Normal


In [159]:
train.columns

Index([u'MSSubClass', u'MSZoning', u'LotFrontage', u'LotArea', u'Street',
       u'Alley', u'LotShape', u'LandContour', u'Utilities', u'LotConfig',
       u'LandSlope', u'Neighborhood', u'Condition1', u'Condition2',
       u'BldgType', u'HouseStyle', u'OverallQual', u'OverallCond',
       u'YearBuilt', u'YearRemodAdd', u'RoofStyle', u'RoofMatl',
       u'Exterior1st', u'Exterior2nd', u'MasVnrType', u'MasVnrArea',
       u'ExterQual', u'ExterCond', u'Foundation', u'BsmtQual', u'BsmtCond',
       u'BsmtExposure', u'BsmtFinType1', u'BsmtFinSF1', u'BsmtFinType2',
       u'BsmtFinSF2', u'BsmtUnfSF', u'TotalBsmtSF', u'Heating', u'HeatingQC',
       u'CentralAir', u'Electrical', u'1stFlrSF', u'2ndFlrSF', u'LowQualFinSF',
       u'GrLivArea', u'BsmtFullBath', u'BsmtHalfBath', u'FullBath',
       u'HalfBath', u'BedroomAbvGr', u'KitchenAbvGr', u'KitchenQual',
       u'TotRmsAbvGrd', u'Functional', u'Fireplaces', u'FireplaceQu',
       u'GarageType', u'GarageYrBlt', u'GarageFinish', u'GarageCars'

In [160]:
pd.set_option('display.max_columns', None)
train.select_dtypes(include=[np.number]).loc[1:5,:]

Unnamed: 0_level_0,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,TotRmsAbvGrd,Fireplaces,GarageYrBlt,GarageCars,GarageArea,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1
1,60,65.0,8450,7,5,2003,2003,196.0,706,0,150,856,856,854,0,1710,1,0,2,1,3,1,8,0,2003.0,2,548,0,61,0,0,0,0,0,2,2008
2,20,80.0,9600,6,8,1976,1976,0.0,978,0,284,1262,1262,0,0,1262,0,1,2,0,3,1,6,1,1976.0,2,460,298,0,0,0,0,0,0,5,2007
3,60,68.0,11250,7,5,2001,2002,162.0,486,0,434,920,920,866,0,1786,1,0,2,1,3,1,6,1,2001.0,2,608,0,42,0,0,0,0,0,9,2008
4,70,60.0,9550,7,5,1915,1970,0.0,216,0,540,756,961,756,0,1717,1,0,1,0,3,1,7,1,1998.0,3,642,0,35,272,0,0,0,0,2,2006
5,60,84.0,14260,8,5,2000,2000,350.0,655,0,490,1145,1145,1053,0,2198,1,0,2,1,4,1,9,1,2000.0,3,836,192,84,0,0,0,0,0,12,2008


In [161]:
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.preprocessing import FunctionTransformer, MinMaxScaler
# uwaga, to nie jest enkoder ze scikit, ale implementuje jego API, wiec mozna uzywac!
from category_encoders.one_hot import OneHotEncoder
from sklearn.base import TransformerMixin

In [162]:
#train['MSSubClass'] = train['MSSubClass'].astype(str)

# to samo dla test
# ..albo:

def fix_type(df):
    df['MSSubClass'] = df['MSSubClass'].astype(str)
    return df

# transformer "bez pamięci" - zrobi dokladnie to samo dla train set i test set
fix_type_transformer = FunctionTransformer(fix_type, validate=False)

#fixed_train = fix_type_transformer.fit_transform(train)
#fixed_test = fix_type_transformer.transform(test)

In [163]:
# transformer "z pamięcia"
# konieczne gdy feature w train i test mają inne charakterystyki
class FilterBinaryOutliers(TransformerMixin):
    def __init__(self):
        self.min_count = 2
        self._excludes = []
        
    def fit(self, X, y=None, **kwargs):
        binary_selector = X.apply(lambda col: col.nunique()) <= 2
        nrows = X.shape[0]
        ol = X.loc[:,binary_selector].apply(lambda col: nrows - col.groupby(col).size().max() <= self.min_count )
        self._excludes = ol.index[ol]
        return self
    
    def transform(self, X, y=None, **kwargs):
        return X.drop(self._excludes, axis=1)
    
    # niepotrzebne:
    def get_params(self, deep=True):
        return { "excluded_columns" : self._excludes }
    
tmp_train = pd.DataFrame({
    'col1': [0, 0, 0, 1, 1, 1],
    'col2': [0, 1, 1, 1, 1, 1]
})
tmp_test = pd.DataFrame({
    'col1': [0, 0, 0, 0, 1, 1],
    'col2': [0, 0, 0, 1, 1, 1]
})

f = FilterBinaryOutliers()
f.fit(tmp_train)
f.transform(tmp_test)

Unnamed: 0,col1
0,0
1,0
2,0
3,0
4,1
5,1


In [187]:
numeric_pipeline = Pipeline([
    ('SelectNumeric', FunctionTransformer(lambda df: df.select_dtypes(include=[np.number]), validate=False)),
    ('FillNA', FunctionTransformer(lambda df: df.fillna(-1), validate=False)),
    ('Normalize', MinMaxScaler())
])

categoric_pipeline = Pipeline([
    ('SelectCategoric', FunctionTransformer(lambda df: df.select_dtypes(exclude=[np.number]), validate=False)),
    # z "pamięcia" zapamiętującą w fazie fit
    ('OHE', OneHotEncoder(handle_unknown='ignore', return_df=True)),
    ('FilterOutliers', FilterBinaryOutliers())
])

feature_union = FeatureUnion([
    ('numeric', numeric_pipeline),
    ('categoric', categoric_pipeline),
])

transform_pipeline = Pipeline([
    ("FixType", fix_type_transformer),
    ("TransformData", feature_union)
])

#### Trening i walidacja

In [165]:
from sklearn.model_selection import train_test_split
# dużo, dużo więcej możliwości walidacji modelu: 
# http://scikit-learn.org/stable/modules/classes.html#module-sklearn.model_selection

X_train, X_valid, y_train, y_valid = train_test_split(train, y, test_size=0.25, shuffle=True)

In [166]:
from sklearn.linear_model import LogisticRegression

logreg_pipeline = Pipeline([
       ("Transform", transform_pipeline),
       ('Estimator', LogisticRegression()),
])

logreg_pipeline.fit(X_train, y_train)
prediction = logreg_pipeline.predict(X_valid)

# nie przejmować się SettingWithCopyWarning

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


In [167]:
from sklearn.metrics import mean_squared_error
from math import sqrt

# Root-mean-square deviation
RMSD = sqrt(mean_squared_error(prediction, y_valid))
print("RMSD %f" % RMSD)

# na kagglu chcą RMSD z log
def kaggle_RMSD(prediction, true):
    prediction_log = map(sqrt, prediction)
    y_valid_log = map(sqrt, true)
    return sqrt(mean_squared_error(prediction_log, y_valid_log))

print("RMSD of logs %f" % kaggle_RMSD(prediction, y_valid))

RMSD 54891.311134
RMSD of logs 55.601081


### Tuning regresora - GridSearch + CV

In [226]:
## Tuning regresora
from sklearn.model_selection import GridSearchCV

params = {
    #'penalty' : ['l1', 'l2'],
    #'dual': [True, False],
    'tol': [1e-6, 1e-4, 1e-2],
    #'C': [0.8, 1.0, 1.2, 1.4],
    'fit_intercept': [True, False],  
}
# all listed here http://scikit-learn.org/stable/modules/model_evaluation.html
scoring = {'MSE': 'neg_mean_squared_error'}

gscv = GridSearchCV(LogisticRegression(), 
                    param_grid=params,
                    scoring='neg_mean_squared_error',
                    refit='neg_mean_squared_error',
                    n_jobs=5,
                    cv=4)

In [219]:
X_train_trans = transform_pipeline.fit_transform(X_train)
X_valid_trans = transform_pipeline.transform(X_valid)

gscv.fit(X_train_trans, y_train)
prediction = gscv.predict(X_valid_trans)
best_logreg = gscv.best_estimator_
best_logreg

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


LogisticRegression(C=1.2, class_weight=None, dual=False, fit_intercept=False,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [227]:
# Alternatrywnie moglibyśmy zaprząc grid search BEZPOŚREDNIO do naszego pipeline'u jako estymator
# uwaga: tak nie da się wprost wyciągnąć znalezionych hiperparametrów przez GridSearch
logreg_tuned_pipeline = Pipeline([
       ("Transform", transform_pipeline),
       ('Estimator', gscv),
])

logreg_tuned_pipeline.fit(X_train, y_train)
prediction = logreg_tuned_pipeline.predict(X_valid)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


In [228]:
kaggle_RMSD(prediction, y_valid)

55.71208382618633

In [31]:
# final
logreg_pipeline.fit(train, y)
prediction = logreg_pipeline.predict(test)

In [74]:
def kaggle_save(test, prediction):
    result = pd.DataFrame({
            'Id': test.index, 
            'SalePrice': prediction
    })
    result.to_csv("final.csv", index=False)
    return result

### Neural Nets - Keras

In [33]:
import keras
from keras.layers import Input, Dense, Dropout, Concatenate
from keras import Model
from keras.callbacks import EarlyStopping
from sklearn.base import BaseEstimator
from sklearn.preprocessing import StandardScaler
from keras import metrics

Using TensorFlow backend.


In [100]:
def create_model(colnum=258):
	# create model
    inputs = Input(shape=(colnum,))
    dense1 = Dense(256, activation='relu')(inputs)
    dense1 = Dropout(0.5)(dense1)
    dense2 = Dense(256, activation='relu')(dense1)
    dense2 = Dropout(0.5)(dense2)
    output = Dense(1, activation='relu', name='output')(dense2)
    model = Model(inputs=[inputs], outputs=[output])
    # exploding gradient problem when sgd as optimizer
    model.compile(optimizer='adam', loss='mean_squared_error', metrics=[metrics.mean_squared_error])
    return model


def create_model_3(colnum=258):
	# create model
    inputs = Input(shape=(colnum,))
    dense1 = Dense(256, activation='relu')(inputs)
    dense1 = Dropout(0.5)(dense1)
    dense2 = Dense(256, activation='relu')(dense1)
    dense2 = Dropout(0.5)(dense2)
    dense3 = Dense(128, activation='relu')(dense2)
    dense4 = Dense(32, activation='relu')(dense3)
    output = Dense(1, activation='relu', name='output')(dense4)
    model = Model(inputs=[inputs], outputs=[output])
    model.compile(optimizer='adam', loss='mean_squared_error', metrics=[metrics.mean_squared_error])
    return model

`Input` reprezentuje warstwę wejściową. Musimy podać ilość neuronów, drugi wymiar możemy zostawić pusty - keras sam się go domyśli na podstawie rozmiaru batcha wejściowego. 

'Dense' reprezentuje warstwę gęstą, tzn taką, w której każdy neuron jest połączony z każdym neuronem poprzedniej warstwy. W postaci macierzowej możemy zapisać to jako $Y = f(AX + b)$, gdzie $f$ jest funkcją aktywacji, $X$ jest macierzą obserwacji w poprzedniej warstwie, $A$ to macierz wag a $b$ to wektor biasów.

'Dropout' to operacja, w wyniku której odsetek neuronów jest tymczasowo zerowany podczas traningu. Dzięki temu model ma mniejszą tendencję do przeuczania się.

Ostatnia warstwa zawiera tylko jeden neuron - jest to odpowiedź zwracana przez model. Za pomocą funkcji sigmoidalnej skalujemy czystą odpowiedź na przedział [0, 1].

Model jest następnie "kompilowany" - zadawana jest funkcja straty, definiujemy algorytm optymalizacji oraz metryki zwracane podczas uczenia.

In [35]:
# mógłbym zawołać tak:
X_train_trans = transform_pipeline.fit_transform(X_train)
X_valid_trans = transform_pipeline.transform(X_valid)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


In [98]:
model = create_model_3(colnum=X_train_trans.shape[1])
model.fit([X_train_trans], [y_train], epochs=100, batch_size=32, verbose=1, validation_data=[[X_valid_trans], [y_valid]])

Train on 1095 samples, validate on 365 samples
Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100


Epoch 82/100
Epoch 83/100
Epoch 84/100
Epoch 85/100
Epoch 86/100
Epoch 87/100
Epoch 88/100
Epoch 89/100
Epoch 90/100
Epoch 91/100
Epoch 92/100
Epoch 93/100
Epoch 94/100
Epoch 95/100
Epoch 96/100
Epoch 97/100
Epoch 98/100
Epoch 99/100
Epoch 100/100


<keras.callbacks.History at 0x7f8c583af990>

In [99]:
nn_predict = model.predict(X_valid_trans).flatten()
kaggle_RMSD(nn_predict, y_valid)

25.97869574321095

In [95]:
# prepare final submission for kaggle:
train_trans = transform_pipeline.fit_transform(train)
test_trans = transform_pipeline.transform(test)
model = create_model_3(colnum=train_trans.shape[1])
model.fit([train_trans], [y], epochs=400, batch_size=32, verbose=0)
nn_predict = model.predict(test_trans).flatten()

In [96]:
kaggle_save(test, nn_predict)

Unnamed: 0,Id,SalePrice
0,1461,121975.085938
1,1462,149423.015625
2,1463,175245.843750
3,1464,193364.343750
4,1465,183697.609375
5,1466,165606.500000
6,1467,172794.718750
7,1468,159737.921875
8,1469,172448.125000
9,1470,122740.132812


In [None]:
# ale mozemy takze wpleść model w nasz pipeline

from keras.wrappers.scikit_learn import KerasRegressor

nn_model = KerasRegressor(build_fn=create_model_3, epochs=10, batch_size=32, verbose=1)

full_pipeline = Pipeline([
       ("Transform", transform_pipeline),
       ('Estimator', nn_model),
])

full_pipeline.fit(x_train, y)
prediction = full_pipeline.predict(X_valid)


TODO:
* dodać sekcje wizualizacja: https://pandas.pydata.org/pandas-docs/stable/visualization.html
* zamienić import pandas na import z scikit-learn

Ciekawe linki:
* https://shiring.github.io/r_vs_python/2017/01/22/R_vs_Py_post
