# Text Message Analysis
## Overview
This project will encompass the analysis of a dataset containing 67,093 text messages (SMSs) taken from the corpus on Mar 9, 2015 and is mostly comprised of messages from Singaporeans and students attending the National University of Singapore. The objective of the project will be to do a **Sentiment Analysis** and **Frequency Insight** of the dataset. With this project we'll gather insights about how text messages are used, and answer questions such as:
- Are short messages more negative than longer messages?
- Are positive messages more dominant in the dataset?
- Do positive messages contain more emojis or informal words?

## Data Loading & Cleaning
The first step in this project is to load the data and get it ready for examination, so we'll start by importing the necessary libraries to run the project, import the data, and clean the dataframe.

In [1]:
# Importing relevant libraries
import pandas as pd
import numpy as np

# Loading the data
sms = pd.read_csv('clean_nus_sms.csv')

# Confirming by printing first few rows
sms.head(3)

Unnamed: 0.1,Unnamed: 0,id,Message,length,country,Date
0,0,10120,Bugis oso near wat...,21,SG,2003/4
1,1,10121,"Go until jurong point, crazy.. Available only ...",111,SG,2003/4
2,2,10122,I dunno until when... Lets go learn pilates...,46,SG,2003/4


In [2]:
# Checking column data types and looking for NULL values
sms.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48598 entries, 0 to 48597
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Unnamed: 0  48598 non-null  int64 
 1   id          48598 non-null  int64 
 2   Message     48595 non-null  object
 3   length      48598 non-null  object
 4   country     48598 non-null  object
 5   Date        48598 non-null  object
dtypes: int64(2), object(4)
memory usage: 2.2+ MB


It looks like the first column is the index, so it can be eliminated, as keeping it would just be having redundant information. The data types for the columns are appropriate with the exception of `length`, which should be `int64`, so we'll correct this. This situation was likely caused by at least one of the values not being a valid integer, so we'll convert to numeric and force non-numeric values to be `NaN`. It also looks like there are only three text messages with `NULL` values, as messages without text are not meaningful for our project they will be dropped.

In [3]:
# Dropping the first column, the repeated index
sms = sms.drop(columns = "Unnamed: 0")

# Changing the datatype for the column 'length'
sms['length'] = pd.to_numeric(sms['length'], errors = 'coerce')

# Dropping rows where  there are NULL values
sms = sms.dropna(how = 'any')

# Getting summary description
sms.head(3)

Unnamed: 0,id,Message,length,country,Date
0,10120,Bugis oso near wat...,21.0,SG,2003/4
1,10121,"Go until jurong point, crazy.. Available only ...",111.0,SG,2003/4
2,10122,I dunno until when... Lets go learn pilates...,46.0,SG,2003/4


Now, we can discover some summary statistics for `length` and assess how many different countries is the dataset comprised of.

In [4]:
# Printing the summary statistics for 'length'
display(sms['length'].describe())

# Displaying the different countries in the dataset
sms['country'].value_counts()

count    48591.000000
mean        54.853594
std         53.203152
min          1.000000
25%         21.000000
50%         39.000000
75%         70.000000
max        910.000000
Name: length, dtype: float64

country
Singapore              22011
SG                      9803
India                   6901
United States           3748
USA                     1931
Sri Lanka               1017
Malaysia                 766
Pakistan                 751
unknown                  602
Canada                   198
Bangladesh               126
China                    107
india                    105
INDIA                     79
Philippines               67
Indonesia                 48
Nepal                     39
srilanka                  30
United Kingdom            30
Hungary                   28
Serbia                    22
Kenya                     20
Ghana                     18
UK                        10
Italia                    10
Trinidad and Tobago       10
Turkey                    10
Nigeria                   10
Macedonia                 10
New Zealand               10
Slovenia                  10
Lebanon                   10
Romania                    9
Morocco                    9
Austra

From the country list it's observable that some more cleaning must be done, namely merge coutries that are repeated twice with different capitalization and standardizing the countries' names by converting all to Title case.

In [8]:
# Merging repeated countries names
country_mapping = {
    'SG': 'Singapore',
    'Singapore': 'Singapore',
    'USA': 'United States',
    'United States': 'United States',
    'india': 'India',
    'INDIA': 'India',
    'India': 'India',
    'UK': 'United Kingdom',
    'United Kingdom': 'United Kingdom',
    'srilanka': 'Sri Lanka',
    'Sri Lanka': 'Sri Lanka',
    'Italia': 'Italy',
    'MY': 'Malaysia'  # likely Malaysia
}

# Apply the mapping, leave all other countries as-is
sms['country'] = sms['country'].map(lambda x: country_mapping.get(x, x))

# Converting all to Title case
sms['country'] = sms['country'].str.title()

# Checking results
sms['country'].value_counts()

country
Singapore              31814
India                   7085
United States           5679
Sri Lanka               1047
Malaysia                 767
Pakistan                 751
Unknown                  602
Canada                   198
Bangladesh               126
China                    107
Philippines               67
Indonesia                 48
United Kingdom            40
Nepal                     39
Hungary                   28
Serbia                    22
Kenya                     20
Ghana                     18
Lebanon                   10
Trinidad And Tobago       10
Macedonia                 10
Turkey                    10
Nigeria                   10
Slovenia                  10
New Zealand               10
Italy                     10
Morocco                    9
Romania                    9
Australia                  9
Jamaica                    8
Barbados                   8
France                     5
Spain                      5
Name: count, dtype: int64

## 2. Preprocessing
Once the data is clean, we can start preprocessing it. The `Message` column will be lowercased and hacve the punctuation and stopwords removed, and we'll apply lemmatization.