## <font color='green'> Sentiment Analysis of Text Messages </font>


The main focus of this project is to explore how the experience of the messaging feature provided for applicants looking for a room and the residents of the household, shapes the decision of the applicants and the residents and affects their decision of applicant in signing a lease and residents in approving or refusing an applicant. 

Below I explain the steps I took into navigating the project.

### <font color='blue'> Exploratory Analysis </font> 

The data includes 180K observation and the following information: 

* **Channel ID**, assigned to each chat message initiated coresponding to a listing. 

* **Creation time stamp** corresponding to each sent message.

* **Users ID** (the applicants and residents) sending a message. 

* The role of each **user** (applicant or resident). 

* The **lease status** corresponding to the message channel initiated for each listing.

* **Market** as in the city where the listing is located. 

* And finally, the **messages**.

Through initial exploration of the data set the following observations are made: 

* Overall, the number of messages exchanged every month is drastically increasing. Meaning more clients are using the application and looking up. However, the number of successful leases shows smaller increase. Meaning that despite larger engagement and usage the outcome of successful leases is not growing at the same speed.  
<img src="images/message_per_month.png"/>


* More applicants engage and send messages. This could be particularly intresting considering that there are usually more than one resident in each household. Therefore, not all the residents within a household are engaging in the chat messages, while they all should vote on accepting or refusing an applicant. In other words, it could be an indication of lower engagement of residents in the process. 
<img src="images/engagement.png"/>

These two main observations points us in the direction of exploring how the chat messages can improve the ...  

> The illustration are produced via [exploratorty and explanatory notebook](). 

### <font color='blue'> Preprocessing and Feature Engineering </font> 


The data set is preprocessed to prepare the data for sentiment analysis, and set up features used for machine learning model. In particular:

* The response time for each message is quantified, using the time stamps provided for each message. 

* The text of messages is cleaned up to remove http addresses, new lines, and other notations that are in the text because of ... 

* The length of conversations is quantified as the number of messages sent. 

* Messages are numbered based on their order in the conversation.

See [preprocessing notebook]() for details. 

### <font color='blue'> Sentiment Analysis </font> 

Using natural language processing techniques, the sentiment around each text message is evaluated. 

In particular, [Stanford CoreNLP API](https://stanfordnlp.github.io/CoreNLP/index.html) sentiment tool is used to extract the sentiment score of each sentence in a message. The Sentiment Annotator of CoreNLP implements [Socher et al.](https://nlp.stanford.edu/~socherr/EMNLP2013_RNTN.pdf)'s sentiment model by attaching a binary tree in the sentence level. The Node of the tree contains the predicted class and scores for that subtree. The current version of the sentiment annotator of CoreNLP includes five score classes: very negative, negative, neutral, positive, and very positive. 


To record the sentiment scores of text messages, this notebook extracts the probability distribution associated with the 5 score classes (very negative to very positive) for each sentence within a message. A score of -2, -1, 0, 1, 2 is assigned to the 5 classes, and with that, the expected score for each sentence is calculated as: 

$$ E_i = \sum_{j=1}^5 P_{i,j}(s) \times s_{i,j} $$ 

where $E_{i}$ is expected score corresponding to sentence *i*, and $P_j$ is the probability associated with each score class, $s_j$ in (-2,-1,0,1,2). 

Ultimately, the frequency of score classes for a text message is calculated by taking the histogram of expected scores of all sentences. Furthermore, The average of expected scores of sentences is calculated to indicate the overall score of a whole text message. 

Throughout the sentiment analysis, the number of sentences and words within each message is extracted and added to the data frame.

> See the [sentiment analysis notebook]() for further information. 

### <font color='orange'> Implications </font> 

The distribution of sentiment score of all messages shows a multimodal distribution (figure below), with a pretty dominant peak of neutral messages as expected. Most messages intuitively include unbiased content introducing oneself, exchanging contacts to meet in person, etc. There are two smaller peaks observed at negative and positive scores. These messages could include more signaling content. I took a closer look into the messages that contain a higher ratio of negatively scored sentences. Interestingly, most of these messages had more conflicting content, and a pipeline to identify such messages can be an excellent tool for the company to improve the clients' experience. 

Filtering the thousands of messages sent via the application each day and flagging a number of highly signaling messages that the customer service unit could process allows the company to identify clients' concerns and therefore be able to address them more effectively. 

<img src="images/sentiment_distribution.png"/>




### <font color='blue'> Supervised Machine Learning Model </font> 


[Scikit Learn Random Forest Classifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) is trained to predict the lease status corresponding to each channel, following the steps below: 

**Features, observations, and labels:**
  - User role (applicant or resident),
  - Response time,
  - Conversation length,
  - Message length,
  - Average sentiment score for a text,
  - Frequency of sentiment score of all sentences within each message in 5 classes: very negative, negative, neutral, positive, very positive. 
    
The data set includes a total of 180K observations and 10 features. A binary classification is performed to predict the lease status. The  35.31% of conversations correspond to a sucessful lease (same room or another) versus not leasing a room at all (see figure below), and threfore the classes are assumed to be balanced. 

<img src="images/ml_classes.png"/>

The data set is divided into training and testing (90%:10%) sets, manually, in order to assure that messages corresponding to the same channel are kept together within one data set. 


**Preprocessing**: The categorical featuer, i.e., the user status (applicant or residents), is encoded via [One Hot Encoder algorithm](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html). 

**Training a classifier**: The hyperparameters are first narrowed down using [Randomized Search Cross Validation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html) and further optimized via [Grid Search](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) both provided in the python Scikit Learn library. 

> See [ML model notebook]() for details. 

### <font color='orange'> Implications </font> 

The performance of ML model is limited by features that are used in training. There are many other components, other than the chat impression, which can affect (1) the resident's vote on accepting or rejecting an applicant, and (2) the applicant's decision on siging a lease for a room. 

In future outlook, other features such as (1) the number of maintance tickets submitted via each household, (2) unit price compared to market prices, and (3) the number of available rooms versus total rooms could be included in order to improve the accuracy of the predictions. 