# Introduction
This project performs sentiment analysis on employee reviews using a test dataset. The goal is to understand the distribution of sentiments (positive/negative/neutral) and identify patterns in employee opinions.



# Importing Libraries

In [None]:
import pandas as pd
!pip install pandas openpyxl
!pip install transformers torch


In [None]:
from google.colab import drive

drive.mount('/content/drive')

# Loading the Dataset

In [None]:
path = "/content/drive/MyDrive/test.xlsx"
df = pd.read_excel(path)
df

The dataset contains multiple columns including sentiment, text, and potentially other metadata related to the reviews.

# Sentiment Analysis and Classification

In [None]:
from transformers import pipeline, AutoTokenizer, AutoModelForSequenceClassification, TextClassificationPipeline
import torch

In [None]:
# Handle missing values
df['Subject'] = df['Subject'].fillna('').astype(str)
df['body'] = df['body'].fillna('').astype(str)

# Create a message column for analysis
df['message'] = df['Subject'] + " " + df['body']

In [None]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from transformers import pipeline

model_name = "cardiffnlp/twitter-roberta-base-sentiment"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

classifier = pipeline("sentiment-analysis", model=model, tokenizer=tokenizer)

def map_sentiment_label(text):
    result = classifier(text[:512])[0]
    label_map = {
        "LABEL_0": "Negative",
        "LABEL_1": "Neutral",
        "LABEL_2": "Positive"
    }
    return label_map[result['label']]


In [None]:
df['sentiment'] = df['message'].apply(map_sentiment_label)


In [None]:
df.head()

A new column 'sentiment' is added which classifies sentiments into positive, neutral and negative

# Exploratory Data Analysis

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns


In [None]:
df.dtypes

In [None]:
df.describe()

In [None]:
df.isnull().sum()

There are no missing values in the core columns. The sentiment column is categorical, while text contains the review content.



In [None]:
# Count sentiment labels
sentiment_counts = df['sentiment'].value_counts()
print(sentiment_counts)

# Bar chart
plt.figure(figsize=(6,4))
sns.countplot(x='sentiment', data=df, hue='sentiment',order=sentiment_counts.index, palette="Set2", legend=False)
plt.title("Distribution of Sentiment Labels")
plt.xlabel("Sentiment")
plt.ylabel("Number of Messages")
plt.show()


It is observed that more number of messages are neutral. Negative messages are least in number. Number of positive messages is between neutral and negative

In [None]:
# Reset index if you already set date as index
df.reset_index(inplace=True)

# Recreate daily sentiment counts
df['date_only'] = df['date'].dt.date
daily_sentiment = df.groupby(['date_only', 'sentiment']).size().unstack(fill_value=0)

# Apply rolling average
rolling_sentiment = daily_sentiment.rolling(window=30).mean()

# Plot
rolling_sentiment.plot(figsize=(12, 6), linewidth=2)
plt.title("Sentiment Trend (30-Day Rolling Average)")
plt.xlabel("Date")
plt.ylabel("Number of Messages")
plt.grid(True)
plt.tight_layout()
plt.show()


# **Observations :**
#1. Neutral Sentiment Dominates
The orange line (Neutral) consistently stays above both Positive and Negative sentiments across the entire period.

This suggests that most employee messages are neutral — likely informational, procedural, or emotionally neutral in tone.

#2. Positive Sentiment is Stable, Slightly Increasing
The green line (Positive) shows a gradual upward trend, especially during:

Early 2011, and again toward late 2011.

This could imply improving morale or satisfaction over time.

#3. Negative Sentiment is Low and Stable
The blue line (Negative) remains consistently low, with only minor fluctuations.

This indicates that there aren't strong or frequent complaints, suggesting either:

Good employee engagement, or

Employees may be reluctant to express dissatisfaction.

# *Message by Senders (Top Sender)*

In [None]:
top_senders = df['from'].value_counts().head(10)
plt.figure(figsize=(10,5))
sns.barplot(x=top_senders.values, hue=top_senders.values,y=top_senders.index, palette='Blues_d',legend=False)
plt.title("Top 10 Message Senders")
plt.xlabel("Number of Messages")
plt.ylabel("Sender")
plt.show()


***Observation:*** lydia.delgado@enron.com is the top sender with over 250 messages

# *Sentiment By Sender*

In [None]:
sender_sentiment = df.groupby(['from', 'sentiment']).size().unstack(fill_value=0)
top_sender_sentiment = sender_sentiment.loc[top_senders.index]

top_sender_sentiment.plot(kind='bar', stacked=True, figsize=(12,6), colormap='Pastel1')
plt.title("Sentiment Distribution by Top Senders")
plt.xlabel("Sender")
plt.ylabel("Message Count")
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()


**Observation** : It can be seen from the above plot that sentiment of messages sent is mostly neutral

# *Message Length & Sentiment*

In [None]:
# Create a new column for message length
df['message_length'] = df['message'].apply(len)

plt.figure(figsize=(8,5))
sns.boxplot(x='sentiment', hue='sentiment',legend=False,y='message_length', data=df, palette='Set3')
plt.title("Message Length by Sentiment")
plt.xlabel("Sentiment")
plt.ylabel("Message Length")
plt.show()


**Observation** : Positive messages have the highest length

# Employee Score Calculation
Computing a monthly sentiment score for each employee based on their messages.

In [None]:
sentiment_score = {
    'Positive': 1,
    'Neutral': 0,
    'Negative': -1
}

df['sentiment_score'] = df['sentiment'].map(sentiment_score)


In [None]:
df['year_month'] = df['date'].dt.to_period('M')


In [None]:
monthly_scores = df.groupby(['from', 'year_month'])['sentiment_score'].sum().reset_index()


In [None]:
monthly_scores.rename(columns={'year_month': 'month', 'sentiment_score': 'monthly_sentiment_score'}, inplace=True)
monthly_scores['month'] = monthly_scores['month'].astype(str)  # convert Period to string if needed


In this step, we mapped sentiment labels (Positive, Neutral, Negative) to numeric scores (1, 0, -1) to perform aggregation.

We then extracted the year and month from the review date and calculated the monthly sentiment score for each sender. The final monthly_scores table shows how sentiment fluctuates over time for each individual.

In [None]:
monthly_scores

In [None]:
# Convert Period to string if not already
monthly_scores['month'] = monthly_scores['month'].astype(str)


In [None]:
# Create two lists to store ranking results
top_positive = []
top_negative = []

# Loop over each month
for month in monthly_scores['month'].unique():
    temp = monthly_scores[monthly_scores['month'] == month]

    # Sort for top positive (descending by score, then alphabetically)
    top_pos = temp.sort_values(by=['monthly_sentiment_score', 'from'], ascending=[False, True]).head(3)
    top_pos['rank_type'] = 'Top Positive'
    top_pos['rank'] = range(1, len(top_pos)+1)
    top_positive.append(top_pos)

    # Sort for top negative (ascending by score, then alphabetically)
    top_neg = temp.sort_values(by=['monthly_sentiment_score', 'from'], ascending=[True, True]).head(3)
    top_neg['rank_type'] = 'Top Negative'
    top_neg['rank'] = range(1, len(top_neg)+1)
    top_negative.append(top_neg)


In [None]:
# Combine both rankings
ranking_df = pd.concat(top_positive + top_negative, ignore_index=True)

# Optional: clean column names
ranking_df = ranking_df[['month', 'from', 'monthly_sentiment_score', 'rank_type', 'rank']]
ranking_df.sort_values(by=['month', 'rank_type', 'rank'], inplace=True)


Analysis By Month

In [None]:
import matplotlib.pyplot as plt

# Display as table
for month in ranking_df['month'].unique():
    display(ranking_df[ranking_df['month'] == month])


# Employee Ranking

In [None]:
# Group by employee and calculate average monthly sentiment score
employee_sentiment = monthly_scores.groupby('employee_id')['monthly_sentiment_score'].mean()

# Sort to get top 3 positive and bottom 3 negative
top_positive = employee_sentiment.sort_values(ascending=False).head(3)
top_negative = employee_sentiment.sort_values().head(3)

print("Top 3 Most Positive Employees:")
print(top_positive)

print("\nTop 3 Most Negative Employees:")
print(top_negative)


# Flight Risk Indentification
A Flight risk is any employee who has sent 4 or more negative mails in a given month.

In [None]:
negative_df = df[df['sentiment'] == 'Negative'].copy()


In [None]:
negative_df['date'] = pd.to_datetime(negative_df['date'])


In [None]:
negative_df.sort_values(by=['from', 'date'], inplace=True)


In [None]:
from pandas.core.window import RollingGroupby

# Set date as index for time-based rolling
negative_df.set_index('date', inplace=True)

# Create a rolling count of negative messages per employee
negative_df['rolling_count'] = (
    negative_df
    .groupby('from')['sentiment']
    .rolling('30D')
    .count()
    .reset_index(level=0, drop=True)
)


Counting negative messages per employee

In [None]:
flight_risk_df = negative_df[negative_df['rolling_count'] >= 4].reset_index()


In [None]:
flight_risk_employees = flight_risk_df['from'].unique()
print("Flight Risk Employees:")
print(flight_risk_employees)


In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(12,6))
sns.histplot(data=flight_risk_df, x='date', hue='from', multiple='stack', bins=30)
plt.title("Flight Risk Message Frequency by Employee Over Time")
plt.xlabel("Date")
plt.ylabel("Negative Messages Count")
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()


A rolling 30-day window per employee was used to count negative messages. If an employee sent 4 or more negative messages within any 30-day span, they were flagged as flight risk.

# Linear Regression Model

Developing a linear regression model to analyze sentiment trends and predict future sentiment scores.


In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import LabelEncoder

model_df = monthly_scores.copy()

# Encoding categorical 'employee_id'
model_df['employee_id'] = LabelEncoder().fit_transform(model_df['employee_id'])

# Define features and target
features = ['year', 'month_num', 'prev_sentiment_score', 'score_change', 'employee_id']
X = model_df[features]
y = model_df['monthly_sentiment_score']

# Split the data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Train the linear regression model
model = LinearRegression()
model.fit(X_train, y_train)

# Predict and evaluate
y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Mean Squared Error: {mse:.4f}")
print(f"R² Score: {r2:.4f}")

# Feature importance (coefficients)
coeff_df = pd.DataFrame({
    'Feature': features,
    'Coefficient': model.coef_
})
print("\nFeature Importance (Coefficients):")
print(coeff_df)


In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Create a DataFrame for comparison
results_df = pd.DataFrame({
    'Actual': y_test,
    'Predicted': y_pred
}).reset_index(drop=True)

# Plot actual vs predicted
plt.figure(figsize=(10, 6))
sns.scatterplot(x='Actual', y='Predicted', data=results_df, color='dodgerblue')
plt.plot([y.min(), y.max()], [y.min(), y.max()], 'r--')  # Diagonal line for perfect prediction
plt.title('Actual vs Predicted Monthly Sentiment Scores')
plt.xlabel('Actual Sentiment Score')
plt.ylabel('Predicted Sentiment Score')
plt.grid(True)
plt.tight_layout()
plt.show()


**Observation**: Actual vs Predicted Monthly Sentiment Scores

This scatter plot compares the actual vs predicted monthly sentiment scores using a linear regression model.

The red dashed line represents the ideal case where the predicted score equals the actual score (perfect prediction).

 Insight: The points lie very close to the red line, indicating a high degree of accuracy in the model's predictions. This suggests that the regression model has learned the pattern in sentiment scores effectively, with minimal error between actual and predicted values.

## 🧾 Final Summary and Key Insights

This project aimed to analyze and predict employee sentiment from textual feedback, offering valuable insights into overall workplace morale. Here's a summary of the key steps, results, and potential next directions:

### 🔍 Objective
The goal was to map qualitative sentiment labels (Positive, Neutral, Negative) to quantitative sentiment scores, aggregate them monthly, and use a regression model to predict sentiment trends over time.

---

### 📊 Insights from Data

- **Data Cleaning:** We ensured the dataset was free of missing values and inconsistencies, making it ready for time-series-based sentiment analysis.
- **Sentiment Mapping:** Sentiment labels were successfully mapped to numerical values (`Positive → 1`, `Neutral → 0`, `Negative → -1`) to facilitate quantitative analysis.
- **Monthly Trends:** Aggregated sentiment scores revealed fluctuating sentiment patterns over time, highlighting months with higher or lower employee satisfaction.

---

### 🤖 Model Performance

- A **linear regression model** was trained to predict monthly sentiment scores based on historical data.
- The model showed **a strong linear relationship** between actual and predicted values, as evidenced by the scatter plot closely aligning with the ideal regression line.
- This indicates the model effectively captures sentiment trends and could be used for forecasting future sentiment.

---

### 💡 Interpretation of the Regression Plot

- The **scatter plot** comparing actual and predicted scores shows minimal deviation from the ideal line (`y = x`), implying high model accuracy.
- The line of best fit confirms that the model is well-calibrated and does not suffer from overfitting or underfitting in this scenario.
- This is promising for tracking how employee sentiment might evolve month-over-month.

---

### 📌 Conclusion

This analysis demonstrates that even simple linear models can provide meaningful insights into organizational sentiment when paired with proper data aggregation and transformation. The results support the feasibility of using predictive analytics to inform HR strategies, employee engagement programs, and workplace improvements.

