# TampereBNB Listings - Price Prediction
<p> Get ready for an exhilarating data science adventure! In this exciting assignment, you will dive into the world of TampereBNB, the popular platform for short-term accommodation rentals. 
    <br>
    Your mission? To analyze data from this platform and use your data science skills to predict missing prices for some of the listings, using the tools mentioned in the following cell. </p>
<br>

## Instructions
- Train a regression model of your choice on predicting the listing prices of the training data. 
- Use the trained model to get the price predictions for the listings in the testing data.
- Store the resulting dataframe as a pickled (out.pkl) file. 

**NOTE: The code snippets for loading the data files and outputting the resulting dataframe, are provided. Do not update them.**


#### Accessing the dataset
To facilitate your work, we have created two separate training and testing TampereBNB csv files, located within the `data/` folder. Make sure the path to the files is the same, before submitting your solution.

#### TODO

- You are expected to predict prices for the listings on the testing data by using the following libraries (Besides the built-in python modules, specific libraries can be included upon request):
    - scikit-learn (sklearn)
    - pandas
    - numpy
    
- Store your predictions as a dataframe with the attribute `Hinta` (case sensitive).
- Save the dataframe in a pickle file, `out.pkl` (case sensitive).



## Importing the required packages

In [25]:
import sklearn
import numpy as np
import pandas as pd
from geopy.distance import geodesic
from geopy.geocoders import Nominatim

## Loading Data

In [26]:
# !MAKE SURE TO NOT CHANGE THE CODE WITHIN THIS CELL!. 
# Instead, put the data files within a folder named 'data' such that the paths would work.
training_df = pd.read_csv('./data/Tampere_BNB_training_listing.csv')
testing_df = pd.read_csv('./data/Tampere_BNB_testing_listing.csv')


In [27]:
def label_highest_floor(row):
    if row['Krs'].split('/')[0] == row['Krs'].split('/')[1]:
        if int(row['Krs'].split('/')[0]) > 1:
            return "on"
        else:
            return "ei"
    else:
        return "ei"
    
def correct_floor(row):
    return min(int(row['Krs'].split('/')[1]), int(row['Krs'].split('/')[0]))

def huoneisto_string_cleaner(row):
    char_replacements = {
        " ": "",
        "+": ",",
        "x": "",
        ".": "",
        "(": "",
        ")": "",
        "/": ","
    }
    string = row['Huoneisto'].translate(str.maketrans(char_replacements))
    cleaned_string = string.replace("lasitettu", '').replace("lasit", '').replace("las", '')
    return cleaned_string

def label_sauna(row):
    huoneisto_tokens = row['Huoneisto'].lower().split(",")
    sauna_tokens = ['s', 'sauna']
    for token in huoneisto_tokens:
        if token in sauna_tokens:
            return "on"
        if "sauna" in token:
            return "on"
    return "ei"

def label_parveke(row):
    huoneisto_tokens = row['Huoneisto'].lower().split(",")
    parveke_tokens = ['p', 'parv', 'parveke']
    for token in huoneisto_tokens:
        if token in parveke_tokens:
            return "on"
        if "parv" in token:
            return "on"
    return "ei"

def label_kattoterassi(row):
    huoneisto_tokens = row['Huoneisto'].lower().split(",")
    for token in huoneisto_tokens:
        if "kattoterassi" in token:
            return "on"
    return "ei"

def label_huoneiden_lkm(row):
    string_to_int = {
        "yksiö": 1,
        "kaksi": 2,
        "kolme": 3,
        "neljä": 4
    }
    asunnon_tyyppi_tokens = row['Asunnon tyyppi'].lower().split()
    for token in asunnon_tyyppi_tokens:
        if token in list(string_to_int.keys()):
            return string_to_int[token]
        
def get_distance_from_trainstation(row):
    rautatieasema = (61.4986261803628, 23.775149167944846)
    current = (float(row['Leveysaste']), float(row['Pituusaste']))
    return geodesic(current, rautatieasema).km


## Data Cleaning (optional)

In [28]:
# Clean the data for further processing
testing_df['Huoneisto'] = testing_df['Huoneisto'].str.lower()
testing_df['Huoneisto'] = testing_df.apply(huoneisto_string_cleaner, axis=1)
testing_df['Krs'] = testing_df['Krs'].fillna('1/1')
testing_df['Krs'] = testing_df['Krs'].str.replace("-", '')
testing_df['Hissi'] = testing_df['Hissi'].fillna('ei')
testing_df['Kunto'] = testing_df['Kunto'].fillna('tyyd.')

training_df['Huoneisto'] = training_df['Huoneisto'].str.lower()
training_df['Huoneisto'] = training_df.apply(huoneisto_string_cleaner, axis=1)
training_df['Krs'] = training_df['Krs'].fillna('1/1')
training_df['Krs'] = training_df['Krs'].str.replace("-", '')
training_df['Hissi'] = training_df['Hissi'].fillna('ei')
training_df['Kunto'] = training_df['Kunto'].fillna('tyyd.')


## Feature Engineering (optional)

In [29]:
# Feature engineering/extraction/discovery to extract features new from the raw data.
testing_df['Krs'] = testing_df.apply(correct_floor, axis=1)
testing_df['Sauna'] = testing_df.apply(label_sauna, axis=1)
testing_df['Parveke'] = testing_df.apply(label_parveke, axis=1)
testing_df['Kattoterassi'] = testing_df.apply(label_kattoterassi, axis=1)
testing_df['Huoneiden lkm'] = testing_df.apply(label_huoneiden_lkm, axis=1)
testing_df['Etäisyys rautatieasemalta'] = testing_df.apply(get_distance_from_trainstation, axis=1)
testing_df = testing_df.drop(columns=['Huoneisto', 'Krs', 'Asunnon tyyppi', 'Pituusaste', 'Leveysaste'])

training_df['Krs'] = training_df.apply(correct_floor, axis=1)
training_df['Sauna'] = training_df.apply(label_sauna, axis=1)
training_df['Parveke'] = training_df.apply(label_parveke, axis=1)
training_df['Kattoterassi'] = training_df.apply(label_kattoterassi, axis=1)
training_df['Huoneiden lkm'] = training_df.apply(label_huoneiden_lkm, axis=1)
training_df['Etäisyys rautatieasemalta'] = training_df.apply(get_distance_from_trainstation, axis=1)
training_df = training_df.drop(columns=['Huoneisto', 'Krs', 'Asunnon tyyppi', 'Pituusaste', 'Leveysaste'])

## Data Modeling

In [30]:
# Implement your prediction solution here.
# Make sure to store your predictions in the 'Hinta' attribute of the testing dataframe.
# By default, the following assignment will initialize the 'Hinta' column with the given constant.
testing_df["Hinta"] = 1000

In [31]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import KFold
from sklearn.preprocessing import OneHotEncoder

# Combine training_df and testing_df into single dataframe
combined_df = pd.concat([training_df, testing_df], axis=0)

# Reset the index to ensure uniqueness
combined_df = combined_df.reset_index(drop=True)

# Process the data for one-hot encoding
cat_cols = ['Kaupunginosa', 'Talot.', 'Hissi', 'Kunto', 'Sauna', 'Parveke', 'Kattoterassi']
encoder = OneHotEncoder(sparse_output=False, handle_unknown='ignore')
combined_df_cat = encoder.fit_transform(combined_df[cat_cols])
combined_df_cat = pd.DataFrame(combined_df_cat, columns=encoder.get_feature_names_out(cat_cols))

# concatenate the encoded categorical columns with the original dataframe
combined_df = pd.concat([combined_df.drop(cat_cols, axis=1), combined_df_cat], axis=1)

# Separate training and test data again 
train_data = combined_df[:len(training_df)]
test_data = combined_df[len(training_df):]

# Define features- and target columns
features = [col for col in train_data.columns if col != 'Hinta']
target = 'Hinta'

# Create a LinearRegression model
model = LinearRegression()

# Train the LinearRegression model
model.fit(train_data[features], train_data[target])

# Make predictions
predictions = model.predict(test_data[features])

# Add column with prediction to original testing_df 
testing_df['Hinta'] = predictions.round(0).astype(np.int64)

# Save into file out.csv
testing_df.to_csv('out.csv', index=False)

## Store the results

In [32]:
# !MAKE SURE TO NOT CHANGE THE CODE WITHIN THIS CELL!. 
testing_df.to_pickle("out.pkl")