**Sources**

https://www.kaggle.com/kamalkhumar/newyork-room-rental-ads-eda-and-prediction-nlp

https://charlescsr.github.io/mlnotes/python/spacy/spacy-remove-stopwords/

https://charlescsr.github.io/mlnotes/python/pandas/max-number-of-columns/

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load
# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
import pandas as pd
#pd.options.display.max_columns = 9999 #Maximum columns

import warnings
warnings.filterwarnings("ignore")

import missingno as msno
import matplotlib.pyplot as plt

import re
import spacy

from collections import Counter

import plotly.express as px

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.model_selection import cross_val_score

from xgboost import XGBClassifier

# DataFrame Analysis

In [None]:
df = pd.read_csv('../input/newyork-room-rentalads/room-rental-ads.csv')
df.sample(5)

In [None]:
df.info()

In [None]:
df.describe()

In [None]:
df.dtypes

In [None]:
df.isnull().sum().any()

In [None]:
df.isnull().sum()

In [None]:
msno.bar(df)
plt.show()

We have null values. Seems small enough to drop them

In [None]:
df.dropna(how='any', inplace=True)

In [None]:
df["Vague/Not"].value_counts()

Float values when its 1 or 0. Let us convert that to int and then a catgeory and that is our target

In [None]:
df.rename(columns = {"Vague/Not":"Target"},inplace = True)
df.Target = df.Target.astype("int").astype("category")
df

In [None]:
#check for duplicates

len(df[df.duplicated()])

Huh duplicates. I smell higher accuracy with that around let's get rid of them

In [None]:
df = df.drop_duplicates(subset=['Description'])
print(df.head())
print(df.shape)

Much better

# NLP Work

In [None]:
#Normalisation using spaCy

nlp = spacy.load('en')

def normalize(msg):
    
    msg = re.sub('[^A-Za-z]+', ' ', msg) #remove special character and intergers
    doc = nlp(msg)
    res=[]
    for token in doc:
        if(token.is_stop or token.is_punct or token.is_currency or token.is_space or len(token.text) <= 2): #Remove Stopwords, Punctuations, Currency and Spaces
            pass
        else:
            res.append(token.lemma_.lower())
    return res

In [None]:
df["Description"] = df["Description"].apply(normalize)
df.head()

In [None]:
words_collection = Counter([item for sublist in df['Description'] for item in sublist])
freq_word_df = pd.DataFrame(words_collection.most_common(20))
freq_word_df.columns = ['frequently_used_word','count']

freq_word_df.style.background_gradient(cmap='Blues', low=0, high=0, axis=0, subset=None)

In [None]:
fig = px.bar(freq_word_df, x='frequently_used_word', y='count', color='count', title='Most frequent words')
fig.show()

Oh! Our description was all in lists. Let's change that shall we? 

In [None]:
df["Description"] = df["Description"].apply(lambda m : " ".join(m))

# Classification

In [None]:
c = TfidfVectorizer(ngram_range=(1,2)) # Convert our strings to numerical values
mat=pd.DataFrame(c.fit_transform(df["Description"]).toarray(),columns=c.get_feature_names(),index=None)
mat

In [None]:
X = mat
y = df["Target"]

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)

In [None]:
classifier = XGBClassifier()
classifier.fit(X_train, y_train)

In [None]:
y_pred = classifier.predict(X_test)
cm = confusion_matrix(y_test, y_pred)
print(cm)
accuracy_score(y_test, y_pred)

# K-fold Cross Validation

In [None]:
accuracies = cross_val_score(estimator = classifier, X = X_train, y = y_train, cv = 10)
print("Accuracy: {:.2f} %".format(accuracies.mean()*100))
print("Standard Deviation: {:.2f} %".format(accuracies.std()*100))

**Closing Notes:**

* Still got a lot of potential

* Maybe tune it a bit better with GridSearchCV


If you liked this notebook, don't forget to upvote. That would be a big boost of motivation for me to write better code.

<font color="red" size=+1.5><b>I have done notebooks in:</b></font>

<div style="margin-bottom: 20px;">
    &nbsp;
<div style="float:left; margin-right:10px;">
<a href="https://www.kaggle.com/charlessamuel/santander-value-prediction" class="btn btn-info" style="color:white;">Santander Value Prediction</a>
</div>
 
<div style="float:left; margin-right:10px;"> 
<a href="https://www.kaggle.com/charlessamuel/credit-card-fraud-detection-anomaly-detection" class="btn btn-info" style="color:white;">Credit Card Fraud Detection</a>
</div>

<div style="float:left; margin-right:10px;">   
<a href="https://www.kaggle.com/charlessamuel/are-you-getting-the-loan-loan-status-prediction" class="btn btn-info" style="color:white;">Loan Status Prediction</a>
</div>
</div>
    
<div style="float:left; margin-right:10px;">    
<a href="https://www.kaggle.com/charlessamuel/sms-spam-or-not-base-csr" class="btn btn-info" style="color:white;">SMS Spam or Not</a><br><br>
</div> 

<div style="float:right; font-size:30px">
    CSR
</div>