# Ready for your first Kaggle competition?

Kaggle is a popular platform that hosts machine learning competitions.

The platform helps users to interact via forums and shared code, fostering both collaboration and competition.

1. Go to the Kaggle competition [website](https://www.kaggle.com/competitions).
2. Register for an account (it's free).
3. Find the __House Prices - Advanced Regression Techniques__
4. Go to the Data tab, read the description, download the data.

## Inspect the data

Use pandas python package read the csv files and inspect the data: 
* How many examples? 
* How many features?
* Are there non-numerial values? If so how do you handle these cases?
* Are there NaNs? and if so how do you handle such cases?

In [8]:
import pandas as pd
import d2l

# Write your code here to read the data 

df = pd.read_csv("train.csv")
df_train = df.sample(frac=0.7)
df_val = df.drop(df_train.index)
df_test = pd.read_csv("test.csv")
print(df_val.head)

<bound method NDFrame.head of         Id  MSSubClass MSZoning  LotFrontage  LotArea Street Alley LotShape  \
1        2          20       RL         80.0     9600   Pave   NaN      Reg   
4        5          60       RL         84.0    14260   Pave   NaN      IR1   
8        9          50       RM         51.0     6120   Pave   NaN      Reg   
15      16          45       RM         51.0     6120   Pave   NaN      Reg   
18      19          20       RL         66.0    13695   Pave   NaN      Reg   
...    ...         ...      ...          ...      ...    ...   ...      ...   
1432  1433          30       RL         60.0    10800   Pave  Grvl      Reg   
1447  1448          60       RL         80.0    10000   Pave   NaN      Reg   
1448  1449          50       RL         70.0    11767   Pave   NaN      Reg   
1451  1452          20       RL         78.0     9262   Pave   NaN      Reg   
1455  1456          60       RL         62.0     7917   Pave   NaN      Reg   

     LandContour Util

## Class to load the Training, Validation and Test sets.

In [9]:
class KaggleHouse(d2l.DataModule):
    def __init__(self, batch_size, train=None, val=None):
        super().__init__()
        self.save_hyperparameters()
        if self.train is None:
            # read the csv files:
            self.raw_train = df_train
            self.raw_val = df_val
            self.raw_test = df_test

    def preprocess(self):
        """All the things you noticed about the data that needs preprocessing 
           can be addressed here.
        """
               
        label_col_name = "SalePrice"
        features_train = df_train.drop(labels='Id') # Here you want to drop the label column and the 'Id' column from Train set
        features_val = df_val.drop(labels='Id') # Here you want to drop the label column and the 'Id' column from Val set
        features_test = df_test.drop(labels='Id') # Here you want to drop the 'Id' column from Test set

        # Replace NAN numerical features by 0
        df_train = df_train.fillna(0)
        df_val = df_val.fillna(0)
        df_test = df_test.fillna(0)

        # Replace discrete features by one-hot encoding
        # Save preprocessed features (separate between train and validation sets)
        self.train = pd.get_dummies(features_train)
        self.val = pd.get_dummies(features_val)
        self.test = pd.get_dummies(features_test)

    def get_dataloader(self, train):
        label_col_name = "SalePrice"
        data = self.train if train else self.val

        # Define the data tensor (features tensor, labels tensor reshaped appropriately (i.e. (-1, 1)))
        # Note: all the examples need to be tensors so you need to pass
        # Better taking the Logarithm of prices
        features = torch.tensor(data.drop(columns=[label_col_name]).values.astype(float), dtype=torch.float32)
        # Make sure that this tensor has the same dtype as the features (e.g. dtype=torch.float32)
        labels = torch.tensor(data['label_col_name'].values.astype(float), dtype=torch.float32) 
        tensors = (features, labels)
        print(tensors[0].size(), tensors[1].size())
        return self.get_tensorloader(tensors, train)

AttributeError: module 'd2l' has no attribute 'DataModule'

In [None]:
data = KaggleHouse(batch_size=64)


In [None]:
# test if the data is prepared (you'll need to implement prints in the prepare_data method)
data.prepare_data()


In [None]:
# test the data loader : if this works you should get the features and labels tensors sizes: torch.Size([1460, 330]) torch.Size([1460, 1])
data.get_dataloader(train=True)


## Training

In [None]:
# This function is complete: if you have done everything correctly before in the get_dataloarder this should work
def your_training(trainer, data, lr=0.01):
    # Get the training dataloader
    train_loader = data.get_dataloader(train=True)

    model = d2l.LinearRegression(lr) # Initialize the model
    model.board.yscale='log'         # iterative loss plot

    trainer.fit(model, data)         # fit model to data

    return model                     # return the model

In [None]:
# define the trainer (we can use the built in d2l.Trainer)
trainer = d2l.Trainer(max_epochs=20)
your_model = your_training(trainer, data, lr=0.01)

## Evaluate your model on the Test set

In [None]:
your_predictions = your_model(torch.tensor(data.test.values.astype(float), dtype=torch.float32))

# NOTE: we trained the model to predict  the log of the labels. we need exponentiation of predictions
preds_exp = #??? 

## Now save your predictions in a csv file

Read carefully the format they want the predction to be and create the csv file accordingly.

They want two columns, comma separated values, 'Id' and 'SalePrice'

In [None]:
submission = pd.DataFrame({'Id': #????
                           'SalePrice': #?????
                           })

submission.to_csv('kaggle_prices/my_submission.csv', index=False)

## Submit your predition to the Kaggle competition and see your score!