# 2 Identifying variables

In [1]:
import pandas as pd
df = pd.read_csv('train.csv')

### Showing the types and values

In [2]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
Id               1460 non-null int64
MSSubClass       1460 non-null int64
MSZoning         1460 non-null object
LotFrontage      1201 non-null float64
LotArea          1460 non-null int64
Street           1460 non-null object
Alley            91 non-null object
LotShape         1460 non-null object
LandContour      1460 non-null object
Utilities        1460 non-null object
LotConfig        1460 non-null object
LandSlope        1460 non-null object
Neighborhood     1460 non-null object
Condition1       1460 non-null object
Condition2       1460 non-null object
BldgType         1460 non-null object
HouseStyle       1460 non-null object
OverallQual      1460 non-null int64
OverallCond      1460 non-null int64
YearBuilt        1460 non-null int64
YearRemodAdd     1460 non-null int64
RoofStyle        1460 non-null object
RoofMatl         1460 non-null object
Exterior1st      1460 non-n

You can also check how many missing values there are in a column using **isnull()** and **sum()**

In [3]:
df.Fence.isnull().sum()

1179

## Describe the range of each feature

In [4]:
df.LotFrontage.max()

313.0

In [5]:
df.LotFrontage.min()

21.0

In [6]:
df.LotFrontage.describe()

count    1201.000000
mean       70.049958
std        24.284752
min        21.000000
25%        59.000000
50%        69.000000
75%        80.000000
max       313.000000
Name: LotFrontage, dtype: float64

### Unique values of a column
For categorical features, you can inspect which values occur in a column, using `unique`

In [7]:
df.MSZoning.unique()

array(['RL', 'RM', 'C (all)', 'FV', 'RH'], dtype=object)

In [8]:
df.PavedDrive.unique()

array(['Y', 'N', 'P'], dtype=object)

We can also show the frequency of every value for a column.

In [9]:
df.PavedDrive.value_counts()

Y    1340
N      90
P      30
Name: PavedDrive, dtype: int64

### Convert values
Often, it is easier to process your data as numbers. For instance, the feature PavedDrive has a categorical label, bit of we want to use it in a regression algorithm we need to convert it to a number. In this case we convert it in the following way: N=0, P=1, Y=2 (assuming P means something like Partial).

In [10]:
paved_drive = {'N':0, 'P':1, 'Y':2} # setup a dictionary to do the conversion

def convert_paved_drive(p):
    return paved_drive[p] # return the value of p in paved_drive dict

convert_paved_drive('P') # returns 1 because P is at position 1 in the array (indexing starts at 0)

1

We can use Python's **map()** function to apply a function to every element in a collection (or more formally, an iterable). Note that we could alternatively pass the dictionary paved_drive to the map() function, since map() also accepts dictionaries.

In [11]:
df['PavedDriveN'] = df.PavedDrive.map(convert_paved_drive)

In [13]:
df.PavedDriveN[df.PavedDriveN < 2][:10]

21     0
30     0
39     0
41     1
48     0
61     0
88     0
97     1
106    0
108    0
Name: PavedDriveN, dtype: int64

#### Assignment:  convert KitchenQual to a number. In the description of the dataset it reads that the labels mean:

|Label|description|
|:---|---|
|Ex|Excellent|
|Gd|Good|
|TA|Typical/Average|
|Fa|Fair|
|Po|Poor|

In [17]:
# setup a dictionary
kitchen_to_number = {'Ex': 0, 'Gd': 1, 'TA': 2, 'Fa': 3, 'Po':4}

def convert_kitchenQual(k):
    return kitchen_to_number[k]

df['KitchenQual'] = df.KitchenQual.map(convert_kitchenQual)

print(df.KitchenQual.head())


0    1
1    2
2    1
3    1
4    1
Name: KitchenQual, dtype: int64


### Add features

We can add new features to the Dataframe by simply assigning a value to it. In this example we will compute the sum of the 1st floor space and 2nd floor space.

In [15]:
# note you need to index these with [''] because in Python variable names cannot start with a number.
df['2FlrSF'] = df['1stFlrSF'] + df['2ndFlrSF']
df['2FlrSF'][:10]

0    1710
1    1262
2    1786
3    1717
4    2198
5    1362
6    1694
7    2090
8    1774
9    1077
Name: 2FlrSF, dtype: int64

In [16]:
# alternatively, give a list of columns.
df[['1stFlrSF', '2ndFlrSF', '2FlrSF']]

Unnamed: 0,1stFlrSF,2ndFlrSF,2FlrSF
0,856,854,1710
1,1262,0,1262
2,920,866,1786
3,961,756,1717
4,1145,1053,2198
5,796,566,1362
6,1694,0,1694
7,1107,983,2090
8,1022,752,1774
9,1077,0,1077


### Code book

Write a code book. A book in which you list some collections specifics, such as the number of samples, and for every variable a description, datatype, numeric/categorical, #missing values, the value range, an example of a value. After the analysis, you can include the distribution over each variable, how the data was cleaned (missing values and outliers) and transformed. Include every operation done on the data to allow exact replication of these steps.

| variable | description | datatype | numeric/categorical | #missing | range | example value |
|--|--|--|--|--:|:-:|--|
| 1stFlrSF | First Floor square feet | int | Numeric | 0 | 334-4602 | 334 |
| 2FlrSF | Sum First Floor + Second Floor Square Feet | int | Numeric | 0 | 334-5642 | 5642 |
| PavedDriveN | State of driveway | int | numeric | 0 | 0-2 (gravel/dirt, partially paved, paved)| 2 |
| PavedDrive | State of driveway | text | Categorical | 0 | N, P, Y (gravel/dirt, partially paved, paved) | N |
| BsmtQual | Height of the basement | text | Categorical | 37 | Ex, Gd, TA, Fa, Po, NA (Excellent 100+", Good 90-99", Typical 80-89", Fair 70-79", Poor <70", No Basement | Ex |                                                                   

In [21]:
df.KitchenQual.isnull().sum()

0

#### Assignment: Add KitchenQual to the Code Book

| variable | description | datatype | numeric/categorical | #missing | range | example value |
|--|--|--|--|--:|:-:|--|
| 1stFlrSF | First Floor square feet | int | Numeric | 0 | 334-4602 | 334 |
| 2FlrSF | Sum First Floor + Second Floor Square Feet | int | Numeric | 0 | 334-5642 | 5642 |
| PavedDriveN | State of driveway | int | Numeric | 0 | 0-2 (gravel/dirt, partially paved, paved)| 2 |
| PavedDrive | State of driveway | text | Categorical | 0 | N, P, Y (gravel/dirt, partially paved, paved) | N |
| BsmtQual | Height of the basement | text | Categorical | 37 | Ex, Gd, TA, Fa, Po, NA (Excellent 100+", Good 90-99", Typical 80-89", Fair 70-79", Poor <70", No Basement | Ex |
| KitchenQual | Quality of the kitchen | int | | Numeric | 0 | 1-5 (Excellent, Good, Typical/Average, Fair, Poor) | 4 | 