# Regression Predict Student Solution

© Explore Data Science Academy

---
### Honour Code

I {**YOUR NAME, YOUR SURNAME**}, confirm - by submitting this document - that the solutions in this notebook are a result of my own work and that I abide by the [EDSA honour code](https://drive.google.com/file/d/1QDCjGZJ8-FmJE3bZdIQNwnJyQKPhHZBn/view?usp=sharing).

Non-compliance with the honour code constitutes a material breach of contract.

### Predict Overview: EA Twitter Sentiment classification.

This process requires the user to input text (ideally a tweet relating to climate change), and will classify it according to whether or not they believe in climate change. Your company has been awarded the contract to:

- 1. analyse the supplied data;
- 2. identify potential errors in the data and clean the existing data set;
- 3. determine if additional features can be added to enrich the data set;
- 4. build a model that is capable of forecasting the three hourly demand shortfalls;
- 5. evaluate the accuracy of the best machine learning model;
- 6. determine what features were most important in the model’s prediction decision, and
- 7. explain the inner working of the model to a non-technicsetal audience.

Formally the problem statement

This process requires the user to input text (ideally a tweet relating to climate change), and will classify it according to whether or not they believe in climate change.Below you will find information about the data source and a brief data description. You can have a look at word clouds and other general EDA on the EDA page, and make your predictions on the prediction page that you can navigate to in the sidebar.
 

<a id="cont"></a>

## Table of Contents

<a href=#one>1. Importing Packages</a>

<a href=#two>2. Loading Data</a>

<a href=#three>3. Exploratory Data Analysis (EDA)</a>

<a href=#four>4. Data Engineering</a>

<a href=#five>5. Modeling</a>

<a href=#six>6. Model Performance</a>

<a href=#seven>7. Model Explanations</a>

 <a id="one"></a>
## 1. Importing Packages
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Importing Packages ⚡ |
| :--------------------------- |
| In this section you are required to import, and briefly discuss, the libraries that will be used throughout your analysis and modelling. |

---

In [None]:
import streamlit as st
import joblib,os
import pandas as pd
import matplotlib.pyplot as plt
import re
from nltk.corpus import stopwords
from nltk.stem.snowball import SnowballStemmer
from sklearn.model_selection import train_test_split
import string
from nltk import SnowballStemmer, PorterStemmer, LancasterStemmer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.tree import DecisionTreeClassifier




<a id="two"></a>
## 2. Loading the Data
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Loading the data ⚡ |
| :--------------------------- |
| In this section you are required to load the data from the `df_train` file into a DataFrame. |

---

In [None]:
# Vectorizer
#news_vectorizer = open("tfidfvect.pkl","rb")
#tweet_cv = joblib.load(news_vectorizer) # loading your vectorizer from the pkl file

In [None]:
raw = pd.read_csv("train.csv", encoding='utf-8')

<a id="three"></a>
## 3. Exploratory Data Analysis (EDA)
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Exploratory data analysis ⚡ |
| :--------------------------- |
| In this section, you are required to perform an in-depth analysis of all the variables in the DataFrame. |

---


In [None]:
raw.head(20)

In [None]:
raw.info()

In [None]:
raw.shape

In [None]:
raw.isnull().sum()

In [None]:
sentiment_labels = {
    '-1': '-1:Non-believer',
    '0': '0:Not interested',
    '1': '1:Neutral',
    '2': '2:Out of topic'
}

ax = raw['sentiment'].value_counts().plot(kind='bar')
unique_sentiments = raw['sentiment'].unique()
ax.set_xticklabels([sentiment_labels.get(str(sentiment), 'Unknown') for sentiment in unique_sentiments])

for i, v in enumerate(raw['sentiment'].value_counts()):
    label = sentiment_labels.get(str(i), 'Unknown')
    
ax.set_ylabel('Count')

plt.show()




In [None]:
hashtag_list = []  

# Loop over every cell in the "message" column
for message in raw["message"]:
    if message: 
        tags = message.split() 
        for tag in tags:
            tag = "#" + tag.strip(",")  
            tag = tag.lower()  
            hashtag_list.append(tag) 

print(hashtag_list[:20])  


In [None]:
from collections import Counter

hashtag_counts = Counter(hashtag_list)

print("Total unique hashtags:", len(hashtag_counts))

print("unique hashtags:")
for tag, count in hashtag_counts.most_common(7):
    print(tag, "-", count)
#bar graph

In [None]:
hashtag_counts = Counter(hashtag_list)
top_hashtags = hashtag_counts.most_common(7)
hashtags, counts = zip(*top_hashtags)

plt.figure(figsize=(10, 6))
plt.bar(hashtags, counts, color='red')
plt.xlabel('Hashtags')
plt.ylabel('Count')
plt.title('Top 7 Unique Hashtags')
plt.xticks(rotation=45)

plt.show()



In [None]:
raw.head()

In [None]:
print(raw["message"][90])

## Text Cleaning

In [None]:
def remove_handels(post):
    return re.sub('@[^\s]+',' ',post)

In [None]:
raw['message']= raw['message'].apply(remove_handels)
raw.head(10)

In [None]:
pattern_url = r'http[s]?://(?:[A-Za-z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9A-Fa-f][0-9A-Fa-f]))+'
#subs_url = r'url-web'
raw['message'] = raw['message'].replace(to_replace = pattern_url,value = " ", regex = True)
print(raw["message"][2])

In [None]:
def remove_hashtages(post):
    return re.sub('#[^\s]+',' ',post)

In [None]:
raw['message']= raw['message'].apply(remove_hashtages)
print(raw["message"][2])

In [None]:
def remove_punctuation(post):
    return ''.join([l for l in post if l not in string.punctuation])



In [None]:
raw["message"] = raw["message"].apply(remove_punctuation)
print(raw["message"][2])

In [None]:
raw.head()

<a id="four"></a>
## 4. Data Engineering
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Data engineering ⚡ |
| :--------------------------- |
| In this section you are required to: clean the dataset, and possibly create new features - as identified in the EDA phase. |

---

In [None]:
#dealing with imbalances
# Percentage of non spam emails in the dataset 
#len(not_spam)/(len(df))iuo

## Removing noise

In [None]:
#tokenazing
raw = raw.drop(["tweetid"], axis=1)

In [None]:
raw2 =raw
raw2['message'] = raw2['message'].str.split()

In [None]:
raw2.head()

In [None]:
# steming

stemmer = SnowballStemmer("english")
raw2['message'] = raw2['message'].apply(lambda x: [stemmer.stem(y) for y in x])

In [None]:
raw2['message'][2]

In [None]:
#removing stopwords
stopwords_list = stopwords.words('english')
print(stopwords_list)

In [None]:
def remove_stop_words(tokens):    
    return [t for t in tokens if t not in stopwords.words('english')]

In [None]:
raw2['message'] = raw2['message'].apply(remove_stop_words)

In [None]:
raw2['message'][2]

In [None]:
import nltk
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')
lemmatizer = WordNetLemmatizer()
def mbti_lemma(words, lemmatizer):
    return [lemmatizer.lemmatize(word) for word in words] 

In [None]:
raw2['message'] = raw2['message'].apply(mbti_lemma, args=(lemmatizer, ))

In [None]:
raw2.head()

In [None]:
raw2['message'][2]

In [None]:
sentiment_counts = raw['sentiment'].value_counts()

minority_class = sentiment_counts.idxmin()
minority_count = sentiment_counts.loc[minority_class]
downsampled_raw = pd.concat([raw[raw['sentiment'] == minority_class]] +
                            [raw[raw['sentiment'] == sentiment].sample(minority_count, replace=False) 
                             for sentiment in sentiment_counts.index if sentiment != minority_class])
ax = downsampled_raw['sentiment'].value_counts().plot(kind='bar')
ax.set_xticklabels([sentiment_labels.get(str(sentiment), 'Unknown') for sentiment in sentiment_counts.index])
ax.set_ylabel('Count')
plt.show()


In [None]:
y = raw2['sentiment']

# features
X = raw2['message']

In [None]:
X = X.apply(' '.join)

In [None]:
X.head()

<a id="five"></a>
## 5. Modelling
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Modelling ⚡ |
| :--------------------------- |
| In this section, you are required to create one or more regression models that are able to accurately predict the thee hour load shortfall. |

---

In [None]:
# split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
#conveting word into numbers.
from sklearn.feature_extraction.text import CountVectorizer

betterVect = CountVectorizer(stop_words='english', 
                             min_df=2, 
                             max_df=0.5,
                             ngram_range=(1, 1))

In [None]:

X_train_fitted = betterVect.fit_transform(X_train)
X_test_counts = betterVect.transform(X_test)


# logistic regression model

In [None]:
# train the logistic regression model.
lr = LogisticRegression(max_iter= 10000000000)
lr.fit(X_train_fitted.toarray(), y_train)

In [None]:
predictions = lr.predict(X_test_counts)
print(predictions)

In [None]:
print(y_test)

In [None]:
#Intercept
lr.intercept_[0]

In [None]:
#Coefficients
#coeff_df = pd.DataFrame(lr.coef_.T,X.columns,columns=['Coefficient'])
#coeff_df.head()

In [None]:
#Assessing Model Performance using the Confusion Matrix
from sklearn.metrics import confusion_matrix

In [None]:
confusion_matrix(y_test, predictions)

In [None]:
#Confusion Matrix
labels = ['0: not interested', '1: nutural', "-1: non beliver", "2:out of topic"]

pd.DataFrame(data=confusion_matrix(y_test, predictions), index=labels, columns=labels)

In [None]:
#Classification Report in sklearn

print('Classification Report')
print(classification_report(y_test, predictions, target_names=['0: not interested', '1: nutural', "-1: non beliver", "2:out of topic"]))

In [None]:
from sklearn import metrics
print(metrics.classification_report(y_test, predictions))


# Decision Tree Classification Model

In [None]:
#Standarise the data
# from sklearn.preprocessing import StandardScaler
# standard_scaler = StandardScaler()
# X_test_counts = standard_scaler.fit_transform(X)

In [None]:
# tree = DecisionTreeClassifier(random_state=42)

In [None]:
# tree.fit(X_train, y_train)

<a id="six"></a>
## 6. Model Performance
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Model performance ⚡ |
| :--------------------------- |
| In this section you are required to compare the relative performance of the various trained ML models on a holdout dataset and comment on what model is the best and why. |

---

In [None]:
# Compare model performance

In [None]:
# Choose best model and motivate why it is the best choice

In [None]:
raw.to_pickle('train3.pkl')

In [None]:
import joblib

# Assuming 'lr' is your trained Logistic Regression model
joblib.dump(lr, 'train3.pkl')

<a id="seven"></a>
## 7. Model Explanations
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Model explanation ⚡ |
| :--------------------------- |
| In this section, you are required to discuss how the best performing model works in a simple way so that both technical and non-technical stakeholders can grasp the intuition behind the model's inner workings. |

---

In [None]:
# discuss chosen methods logic