# Intro to machine Learning

- It is a subfield of Artificial Intelligence that involves the development of algorithms and statistical models that enable computers to learn from data without being explicitly programmed.
- When we say that a computer is learning from data without being explicitly programmed, it means that we are not giving the computer a set of rules or instructions to follow for a specific task. Instead, we are providing the computer with a large amount of data and allowing it to learn patterns and relationships within the data on its own. 
- The goal of Machine Learning is to enable computers to automatically improve their performance on a specific task by learning from data.


## Machine Learning can be divided into three main categories:
Supervised Learning, Unsupervised Learning, and Reinforcement Learning.

- Supervised Learning involves training a model on labeled data, where the input data has a corresponding output or target variable. The goal of Supervised Learning is to learn a mapping function from the input variables to the output variable.
- Unsupervised Learning involves training a model on unlabeled data, where the input data does not have a corresponding output or target variable. The goal of Unsupervised Learning is to find patterns or structure in the data.
- Reinforcement Learning involves training a model to make decisions based on feedback from the environment.   Machine Learning is used in a wide range of applications, including image and speech recognition, natural language processing, recommendation systems, and predictive analytics.

## Popular algorithms

- Supervised Learning - Linear Regression, Logistic Regression, Decision Trees, Random Forests, and Neural Networks. Examples of problems that can be solved using Supervised Learning include predicting housing prices, classifying emails as spam or not spam, and recognizing handwritten digits.
- Unsupervised Learning - Clustering, Principal Component Analysis (PCA), and Association Rule Mining. Examples of problems that can be solved using Unsupervised Learning include grouping customers based on their purchasing behavior, identifying topics in a large collection of documents, and detecting anomalies in network traffic.

### Importing libraries

In [54]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import LabelEncoder

### Loading dataset

In [48]:
df = pd.read_csv("data/House_Rent_Dataset.csv")

### Feature selection

The process of selecting a subset of relevant features (or variables) from a larger set of features in a dataset. The goal of feature selection is to improve the performance of a machine learning model by reducing the number of features used in the model. This can help to reduce overfitting, improve model interpretability, and reduce the computational cost of training the model.

#### Attribute description

- BHK: Number of Bedrooms, Hall, Kitchen.
- Rent: Rent of the Houses/Apartments/Flats.
- Size: Size of the Houses/Apartments/Flats in Square Feet.
- Floor: Houses/Apartments/Flats situated in which Floor and Total Number of Floors (Example: Ground out of 2, 3 out of 5, etc.)
- Area Type: Size of the Houses/Apartments/Flats calculated on either Super Area or Carpet Area or Build Area.
- Area Locality: Locality of the Houses/Apartments/Flats.
- City: City where the Houses/Apartments/Flats are Located.
- Furnishing Status: Furnishing Status of the Houses/Apartments/Flats, either it is Furnished or Semi-Furnished or Unfurnished.
- Tenant Preferred: Type of Tenant Preferred by the Owner or Agent.
- Bathroom: Number of Bathrooms.
- Point of Contact: Whom should you contact for more information regarding the Houses/Apartments/Flats.

In [49]:
df.head()

Unnamed: 0,Posted On,BHK,Rent,Size,Floor,Area Type,Area Locality,City,Furnishing Status,Tenant Preferred,Bathroom,Point of Contact
0,2022-05-18,2,10000,1100,Ground out of 2,Super Area,Bandel,Kolkata,Unfurnished,Bachelors/Family,2,Contact Owner
1,2022-05-13,2,20000,800,1 out of 3,Super Area,"Phool Bagan, Kankurgachi",Kolkata,Semi-Furnished,Bachelors/Family,1,Contact Owner
2,2022-05-16,2,17000,1000,1 out of 3,Super Area,Salt Lake City Sector 2,Kolkata,Semi-Furnished,Bachelors/Family,1,Contact Owner
3,2022-07-04,2,10000,800,1 out of 2,Super Area,Dumdum Park,Kolkata,Unfurnished,Bachelors/Family,1,Contact Owner
4,2022-05-09,2,7500,850,1 out of 2,Carpet Area,South Dum Dum,Kolkata,Unfurnished,Bachelors,1,Contact Owner


In [50]:
remove_cols = [
    "Posted On",
    "Point of Contact",
    "Floor",
    "Area Locality",
]
df.drop(
    remove_cols,
    axis=1,
    inplace=True,
)

In [51]:
for col in [
    "Area Type",
    "City",
    "Furnishing Status",
    "Tenant Preferred",
]:
    print(col, "-", list(df[col].unique()))

Area Type - ['Super Area', 'Carpet Area', 'Built Area']
City - ['Kolkata', 'Mumbai', 'Bangalore', 'Delhi', 'Chennai', 'Hyderabad']
Furnishing Status - ['Unfurnished', 'Semi-Furnished', 'Furnished']
Tenant Preferred - ['Bachelors/Family', 'Bachelors', 'Family']


Our goal is predict the rent using given features. So the rent column becomes the target column.

In [52]:
df.columns

Index(['BHK', 'Rent', 'Size', 'Area Type', 'City', 'Furnishing Status',
       'Tenant Preferred', 'Bathroom'],
      dtype='object')

NB: City could be better encoded by using the coordinates of each city. It helps capture location information that one-hot encoding is not able to capture.

Scikit can also handle the encoding. You can check that out!!

In [63]:
df_one_hot = pd.get_dummies(df[["City", "Tenant Preferred"]])

In [64]:
le_area_type = LabelEncoder()
le_area_type.fit(["Unfurnished", "Semi-Furnished", "Furnished"])
df["Furnishing Status"] = le_area_type.transform(df["Furnishing Status"])

In [65]:
le_area_type = LabelEncoder()
le_area_type.fit(["Carpet Area", "Built Area", "Super Area"])
df["Area Type"] = le_area_type.transform(df["Area Type"])

In [84]:
df.dtypes

BHK                          int64
Rent                         int64
Size                         int64
Area Type                    int64
Furnishing Status            int64
Bathroom                     int64
Area Type Encoded            int64
Furnishing Status Encoded    int64
dtype: object

In [66]:
df.head()

Unnamed: 0,BHK,Rent,Size,Area Type,City,Furnishing Status,Tenant Preferred,Bathroom,Area Type Encoded,Furnishing Status Encoded
0,2,10000,1100,2,Kolkata,2,Bachelors/Family,2,2,2
1,2,20000,800,2,Kolkata,1,Bachelors/Family,1,2,1
2,2,17000,1000,2,Kolkata,1,Bachelors/Family,1,2,1
3,2,10000,800,2,Kolkata,2,Bachelors/Family,1,2,2
4,2,7500,850,1,Kolkata,2,Bachelors,1,1,2


In [67]:
remove_cols = ["City", "Tenant Preferred"]
df.drop(remove_cols, axis=1, inplace=True)

In [68]:
df_merged = pd.merge(df, df_one_hot, left_index=True, right_index=True)

In [69]:
df_merged.head()

Unnamed: 0,BHK,Rent,Size,Area Type,Furnishing Status,Bathroom,Area Type Encoded,Furnishing Status Encoded,City_Bangalore,City_Chennai,City_Delhi,City_Hyderabad,City_Kolkata,City_Mumbai,Tenant Preferred_Bachelors,Tenant Preferred_Bachelors/Family,Tenant Preferred_Family
0,2,10000,1100,2,2,2,2,2,0,0,0,0,1,0,0,1,0
1,2,20000,800,2,1,1,2,1,0,0,0,0,1,0,0,1,0
2,2,17000,1000,2,1,1,2,1,0,0,0,0,1,0,0,1,0
3,2,10000,800,2,2,1,2,2,0,0,0,0,1,0,0,1,0
4,2,7500,850,1,2,1,1,2,0,0,0,0,1,0,1,0,0


In [70]:
y = df["Rent"]
X = df.drop(["Rent"], axis=1)

In [71]:
from sklearn.model_selection import train_test_split

In [86]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.85, random_state=42
)

### Training model

In [87]:
regressor = LinearRegression()
regressor.fit(X_train, y_train)

In [88]:
y_pred = regressor.predict(X_test)

In [89]:
import numpy as np
from sklearn.metrics import mean_squared_error

In [90]:
# assuming y_test and y_pred are your test output and predicted output values
rmse = np.sqrt(mean_squared_error(y_test, y_pred))

In [91]:
rmse

71243.48132693172

In [92]:
y_pred

array([17925.61551189, 30130.34694683, 37198.25492689, ...,
       21049.12851805, 34710.23049915, 21148.42360843])

In [93]:
pd.DataFrame({"true_rent": y_test, "predicted_rent": y_pred})

Unnamed: 0,true_rent,predicted_rent
1566,16000,17925.615512
3159,12000,30130.346947
538,28000,37198.254927
2630,8000,68033.438076
4418,46000,118493.589923
...,...,...
2324,5000,39323.169861
2709,4000,-21951.068030
4179,10000,21049.128518
1312,85000,34710.230499


In [None]:
#  # Read in the data df =
# pd.read_csv('data.csv')  # Split the data into training and
# testing sets X_train = df[['feature1', 'feature2',
# 'feature3']] y_train = df['target'] X_test = df[['feature1',
# 'feature2', 'feature3']] y_test = df['target']  # Create the
# linear regression model model = LinearRegression()  # Train
# the model on the training data model.fit(X_train, y_train)
# # Make predictions on the testing data y_pred =
# model.predict(X_test)  # Evaluate the model's performance
# score = model.score(X_test, y_test) print('Model score:',
# score) ```  This code assumes that you have a CSV file
# called 'data.csv' with columns for the features and the
# target variable. You'll need to replace 'feature1',
# 'feature2', 'feature3', and 'target' with the actual column
# names in your data. Let me know if you have any questions or
# if you would like me to explain anything in more detail!