##First we need to import the libraries we are going to use
Here we need two full libraries:
**numpy** (linear algebra and mathematics) and **pandas** (data manipulation and i/o)

We also need some bits from **sklearn** - in particular the RandomForestRegressor and the preprocessing unit.

It is good practice to only import the bits you need from sklearn as it is quite a memory-intensive library.

In [None]:
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestRegressor # import the random forest model
from sklearn import  preprocessing # used for label encoding and imputing NaNs

##Next we import the data

In [None]:
train_df = pd.read_csv('../input/train.csv',)
test_df = pd.read_csv('../input/test.csv')
macro_df = pd.read_csv('../input/macro.csv')

## We assign our prediction variable and set our training set
We also set a column vector containing the id's for our predictions and trim the train and test sets removing the id and timestamp.

In [None]:
id_test = test_df.id
y_train = train_df["price_doc"]
x_train = train_df.drop(["id", "timestamp", "price_doc"], axis=1)
x_test = test_df.drop(["id", "timestamp"], axis=1)

### *The code below can be used to cross-validate the training set. The first piece calculates the cross-validation scores and the second plots them visually based on the number of trees.*

In [None]:
#from sklearn.cross_validation import cross_val_score # We also need the cross validation functionality
#scores = list()
#scores_std = list()

#print('Start learning...')
#n_trees = [10, 50, 75]
#for n_tree in n_trees:
#        print(n_tree)
#        recognizer = RandomForestRegressor(n_tree)
#        score = cross_val_score(recognizer, x_train, y_train)
#        scores.append(np.mean(score))
#        scores_std.append(np.std(score))

#sc_array = np.array(scores)
#std_array = np.array(scores_std)
#print('Score: ', sc_array)
#print('Std  : ', std_array)


#plt.plot(n_trees, scores)
#plt.plot(n_trees, sc_array + std_array, 'b--')
#plt.plot(n_trees, sc_array - std_array, 'b--')
#plt.ylabel('CV score')
#plt.xlabel('# of trees')
#plt.savefig('cv_trees.png')

##Numerical encoding of features
We need to assign a numeric value to each of the features in our training and test sets. 
Sklearn's preprocessing unit has a tool called LabelEncoder() which can do just that for us. 

We could equally combine train and test here and fit this just once  (Maybe we should?)

In [None]:
for c in x_train.columns:
    if x_train[c].dtype == 'object':
        lbl = preprocessing.LabelEncoder()
        lbl.fit(list(x_train[c].values)) 
        x_train[c] = lbl.transform(list(x_train[c].values))
        
for c in x_test.columns:
    if x_test[c].dtype == 'object':
        lbl = preprocessing.LabelEncoder()
        lbl.fit(list(x_test[c].values)) 
        x_test[c] = lbl.transform(list(x_test[c].values))  

##Addressing problems with NaN in the data
As we saw from our EDA there were quite a lot of NaN in the data. Our model won't know what to do with these so we need to replace them with something sensible.

There are quite a few options we can use - the mean, median, most_frequent, or a numeric value like 0. Playing with these will give different results, for now I have it set to use the mean.

 This uses the mean of the column in which the missing value is located. 

In [None]:
imputer = preprocessing.Imputer(missing_values='NaN', strategy = 'mean', axis = 0)
x_train = imputer.fit_transform(x_train)
x_test = imputer.fit_transform(x_test)

## The three step process below is common across many sklearn models

**First** we set an object variable "Model" equal to the model we want to fit. In this case we are dealing with a regression problem and want to fit a Random Forest model so we choose RandomForestRegressor

The parameter labelled 3 below indicates the number of trees we would like in our forest. The default is 10 - I have chosen 3 here for speed. 

The **second** step in the process is to train the model. We do this with our x and y training data. Remember that the y_train set is just the prediction we would like to make - in this instance the price price_doc. The x_train data is the information we are going to use to make that prediction. 

**Thirdly** once we have fit the model we can then use it to make a prediction. We do this by called Model.Predict. We are looking to predict the house prices for our test data so we pass the test-data to the predict method and assign it to y_predict. This will contain our predicted set of house prices. 

In [None]:
Model = RandomForestRegressor(3)
Model.fit(x_train, y_train)
y_predict = Model.predict(x_test)

##Output the data to CSV for submission
Finally we take the id_test vector we created earlier and combine it with our y_predictions to create our CSV for output. 

We are utilising the very useful panda's data frame to do this and it's associated method "to_csv" can write our file out.

In [None]:
output = pd.DataFrame({'id': id_test, 'price_doc': y_predict})

output.to_csv('RandomForest.csv', index=False)