## Predicting 2021 rent  in NYC using multiple regression

### Table of contents
- [Importing libraries and data](#import)
- [Cleaning and processing data](#clean)
- [Developing model(s)](#model)
- [How to improve the model](#improve)

### Importing libraries and data  <a name=import />

In [1]:
#importing libraries

import pandas as pd
import numpy as np
from sklearn import linear_model
from sklearn.model_selection import train_test_split

In [2]:
#importing data

raw_df = pd.read_csv('NYC_Rental_data.csv')

In [3]:
raw_df.head()

Unnamed: 0.1,Unnamed: 0,address,listing url,rent,bedrooms,bathrooms,sqft,rented on
0,0,1 Union Sq S #19C,https://streeteasy.com/rental/3235113?featured=1,"$5,725",1 Bed,1 Bath,,02/05/2021
1,1,180 W 20th Street #3E,https://streeteasy.com/rental/3395206?featured=1,"$6,366",1 Bed,1.5 Baths,,02/02/2021
2,2,450 W 42nd Street #21A,https://streeteasy.com/rental/3544410,"$5,200",1 Bed,2 Baths,,05/30/2021
3,3,224 East 135th Street #2204,https://streeteasy.com/rental/3544344,"$4,100",2 Beds,1 Bath,702.0,05/30/2021
4,4,224 East 135th Street #2201,https://streeteasy.com/rental/3544340,"$4,650",2 Beds,1 Bath,989.0,05/30/2021


### Cleaning and processing the data <a name=clean />

In [4]:
#cleaning data

df = raw_df.drop('Unnamed: 0', axis =1)

In [5]:
#checking for unique listings

df['address'].nunique()

122

Out of the 1300 rows, 122 are unique. 

90.62% of the data was redundant.

In [6]:
# dropping duplicate rows

data = df.drop_duplicates()

In [7]:
# checking for null values

data.isnull().sum()

address         0
listing url     0
rent            0
bedrooms        0
bathrooms       0
sqft           70
rented on       0
dtype: int64

Since the feature ```sqft``` has 70 null values out of 122, we will drop this feature

In [8]:
data = data.drop('sqft', axis = 1)

The columns are non-numerical at this stage and cannot be used for analysis. We will process the data to make it usable now.

In [9]:
data.columns

Index(['address', 'listing url', 'rent', 'bedrooms', 'bathrooms', 'rented on'], dtype='object')

In [10]:
data['rent'] = data['rent'].str.replace('$','', regex=True).str.replace(',','',regex=True).astype('int')

In [11]:
data['bedrooms'] = data['bedrooms'].str.replace('Studio','0', regex=True).replace('Beds','',regex=True).str.replace('Bed','',regex=True)

In [12]:
data['bathrooms'] = data['bathrooms'].str.replace('Baths','',regex=True).str.replace('Bath','',regex=True).astype('float')

### Developing the model(s) <a name=model />

For our analysis, we will be using the variables `bedrooms` and `bathrooms` to predict `rent`

In [13]:
X = data[['bedrooms','bathrooms']]
y = data['rent']

In [14]:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.1,random_state = 10)

In [15]:
clf = linear_model.LinearRegression()

In [16]:
clf.fit(X_train,y_train)

LinearRegression()

In [17]:
clf.score(X_test,y_test)

0.6492540978714583

The R-sqaured value of our model is **64.93%**.

To improve the model, we need to collect more data. 90.62% data was lost due to redundancy and an important feature `sqft` had to be dropped because of 57.78% values being null.

Now we will fill the missing values in the column `sqft` with median and check if it improves the model.

In [18]:
data2 = df[['rent','bedrooms','bathrooms','sqft']]

In [19]:
data2 = data2.drop_duplicates()

Processing the data to make the columns numerical

In [20]:
data2['rent'] = data2['rent'].str.replace('$','', regex=True).str.replace(',','',regex=True).astype('int')

In [21]:
data2['bedrooms'] = data2['bedrooms'].str.replace('Studio','0', regex=True).replace('Beds','',regex=True).str.replace('Bed','',regex=True)

In [22]:
data2['bathrooms'] = data2['bathrooms'].str.replace('Baths','',regex=True).str.replace('Bath','',regex=True).astype('float')

In [23]:
data2['sqft'] = pd.to_numeric(data2['sqft'], errors='coerce')

In [24]:
data2['sqft'] = data2['sqft'].fillna(data2['sqft'].median())

Splitting the training and testing set and fitting the model

In [25]:
X2 = data2[['bedrooms', 'bathrooms', 'sqft']]
y2 = data2['rent']

In [26]:
X2_train, X2_test, y2_train, y2_test = train_test_split(X2,y2,test_size = 0.1,random_state = 10)

In [27]:
clf2 = linear_model.LinearRegression()

In [28]:
clf2.fit(X2_train,y2_train)

LinearRegression()

In [29]:
clf2.score(X2_test,y2_test)

0.6437533473048884

Filling up the missing values in `sqft` column did not have much effect on the accuracy of the model. The R-squared value of the model came down from 64.93% to **64.38%**. This could be due to the fact that there were almost 58% missing values in the `sqft` column that were filled with the median.

In [30]:
clf2.predict([[1,1,700]])

array([3202.30096929])

According to our model, a 700sqft apartement with 1 bedroom and  1 bathroom will roughly have a rent of $3200 per month.

### How to improve the model <a name=improve />

- The column `address` can be used to find the zipcodes the apartments are in and hence the data can be used to predict rents according to zipcode. This will improve the accuracy as the rent prices fluctuate a lot based on zipcodes in NYC.

- 90% of the data was redundant. Collecting unique data would help improve the accuracy of the model.

- Collecting more features such as the type of building, the year of construction, distance to subway station etc. can help be used to make the dataset more granular and may improve the accuracy of the model.