# TampereBNB Listings - Price Prediction
<p> Get ready for an exhilarating data science adventure! In this exciting assignment, you will dive into the world of TampereBNB, the popular platform for short-term accommodation rentals. 
    <br>
    Your mission? To analyze data from this platform and use your data science skills to predict missing prices for some of the listings, using the tools mentioned in the following cell. </p>
<br>

## Instructions
- Train a regression model of your choice on predicting the listing prices of the training data. 
- Use the trained model to get the price predictions for the listings in the testing data.
- Store the resulting dataframe as a pickled (out.pkl) file. 

**NOTE: The code snippets for loading the data files and outputting the resulting dataframe, are provided. Do not update them.**


#### Accessing the dataset
To facilitate your work, we have created two separate training and testing TampereBNB csv files, located within the `data/` folder. Make sure the path to the files is the same, before submitting your solution.

#### TODO

- You are expected to predict prices for the listings on the testing data by using the following libraries (Besides the built-in python modules, specific libraries can be included upon request):
    - scikit-learn (sklearn)
    - pandas
    - numpy
    
- Store your predictions as a dataframe with the attribute `Hinta` (case sensitive).
- Save the dataframe in a pickle file, `out.pkl` (case sensitive).



## Importing the required packages

In [1]:
import sklearn
import numpy as np
import pandas as pd 

## Loading Data

In [2]:
# !MAKE SURE TO NOT CHANGE THE CODE WITHIN THIS CELL!. 
# Instead, put the data files within a folder named 'data' such that the paths would work.
training_df = pd.read_csv('./data/Tampere_BNB_training_listing.csv')
testing_df = pd.read_csv('./data/Tampere_BNB_testing_listing.csv')


## Data Cleaning (optional)

In [3]:
# Clean the data for further processing

In [4]:
testing_df['Huoneisto'] = testing_df['Huoneisto'].str.lower()
testing_df['Krs'] = testing_df['Krs'].fillna('1/1')
testing_df['Hissi'] = testing_df['Hissi'].fillna('ei')
testing_df['Kunto'] = testing_df['Kunto'].fillna('tyyd.')

training_df['Huoneisto'] = training_df['Huoneisto'].str.lower()
training_df['Krs'] = training_df['Krs'].fillna('1/1')
training_df['Hissi'] = training_df['Hissi'].fillna('ei')
training_df['Kunto'] = training_df['Kunto'].fillna('tyyd.')

training_df = training_df[~(training_df['Krs'].apply(lambda x: int(x.split('/')[0]) > int(x.split('/')[1])))]
testing_df = testing_df[~(testing_df['Krs'].apply(lambda x: int(x.split('/')[0]) > int(x.split('/')[1])))]
training_df['Krs'] = training_df['Krs'].str.replace('-', '')
testing_df['Krs'] = testing_df['Krs'].str.replace('-', '')



## Feature Engineering (optional)

In [5]:
# Feature engineering/extraction/discovery to extract features new from the raw data.

## Data Modeling

In [6]:
# Implement your prediction solution here.

In [7]:
# Make sure to store your predictions in the 'Hinta' attribute of the testing dataframe.

# By default, the following assignment will initialize the 'Hinta' column with the given constant.
testing_df["Hinta"] = 10000

In [8]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import KFold
from sklearn.preprocessing import OneHotEncoder

# Yhdistä training_df ja testing_df yhteen dataframeen
combined_df = pd.concat([training_df, testing_df], axis=0)

# Reset the index to ensure uniqueness
combined_df = combined_df.reset_index(drop=True)

# Käsittele data one-hot encodingia varten
cat_cols = ['Kaupunginosa', 'Huoneisto', 'Talot.', 'Krs', 'Hissi', 'Kunto', 'Asunnon tyyppi']
encoder = OneHotEncoder(sparse=False, handle_unknown='ignore')
combined_df_cat = encoder.fit_transform(combined_df[cat_cols])
combined_df_cat = pd.DataFrame(combined_df_cat, columns=encoder.get_feature_names_out(cat_cols))

# concatenate the encoded categorical columns with the original dataframe
combined_df = pd.concat([combined_df.drop(cat_cols, axis=1), combined_df_cat], axis=1)

# Erota taas koulutus- ja testidata
train_data = combined_df[:len(training_df)]
test_data = combined_df[len(training_df):]

# Määritä features- ja target-sarakkeet
features = [col for col in train_data.columns if col != 'Hinta']
target = 'Hinta'

# Luo LinearRegression-malli
model = LinearRegression()

# Kouluta LinearRegression-malli
model.fit(train_data[features], train_data[target])

# Tee ennusteet
predictions = model.predict(test_data[features])

# Lisää ennustehinta alkuperäiseen testing_df-dataframeen
testing_df['Hinta'] = predictions.round(0).astype(np.int64)

# Tallenna tiedostoon out.csv
testing_df.to_csv('out.csv', index=False)

print(testing_df.columns)

Index(['Kaupunginosa', 'Huoneisto', 'Talot.', 'm2', 'Rv', 'Krs', 'Hissi',
       'Kunto', 'Asunnon tyyppi', 'Pituusaste', 'Leveysaste', 'Hinta'],
      dtype='object')




## Store the results

In [9]:
# !MAKE SURE TO NOT CHANGE THE CODE WITHIN THIS CELL!. 
testing_df.to_pickle("out.pkl")