<a href="https://colab.research.google.com/github/saurav-dhait/AI_LAB/blob/main/NLP_SA_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# IMPORTANT: RUN THIS CELL IN ORDER TO IMPORT YOUR KAGGLE DATA SOURCES,
# THEN FEEL FREE TO DELETE THIS CELL.
# NOTE: THIS NOTEBOOK ENVIRONMENT DIFFERS FROM KAGGLE'S PYTHON
# ENVIRONMENT SO THERE MAY BE MISSING LIBRARIES USED BY YOUR
# NOTEBOOK.
import kagglehub
suchintikasarkar_sentiment_analysis_for_mental_health_path = kagglehub.dataset_download('suchintikasarkar/sentiment-analysis-for-mental-health')

print('Data source import complete.')


# IMPORT LIBRARIES

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# IMPORT DATASET

In [None]:
path = '/kaggle/input/sentiment-analysis-for-mental-health/Combined Data.csv'
dataset = pd.read_csv(path)

In [None]:
print(dataset.head())

# EXPLORATORY DATA ANALYSIS
#
* Understanding the dataset's structure and missing values.
* Visualizing the distribution of mental health statuses.
* Preparing the text data (by handling missing values and analyzing the length of the text) for future sentiment analysis or machine learning tasks.

**checking the basic structure of the dataset using df.info().** This provides information about the columns, data types, and the number of non-null entries in each column. It helps to understand the overall layout of the dataset, including how many entries and what types of data are present (e.g., strings, integers, etc.).

In [None]:
print("Dataset Info:")
print(dataset.info())

analysis of above cell
1. Total Entries (Rows):

There are 53,043 entries in the dataset, indexed from 0 to 53,042.
2. Columns:

The dataset contains 3 columns:
Unnamed: 0: This is an integer column with no missing values (int64). It seems like an index column or could be an irrelevant placeholder (possibly from the dataset loading process).
statement: This column contains text data (object type), but has 52681 non-null entries, meaning 362 entries have missing values (53,043 - 52,681 = 362).
status: This is another column with string/object data, representing the mental health status, and it has no missing values.
3. Memory Usage:

The dataset takes up approximately 1.2 MB of memory, which gives an idea of its size.
4. What You Can Infer:
**The column statement contains some missing values, which you need to handle before further analysis (the author of the code fills these missing values with empty strings).
The Unnamed: 0 column may be unnecessary (possibly an index or placeholder), and you can consider dropping it.
status is a complete column with no missing values, so it's ready for analysis.
In summary, you need to handle the missing values in statement, and you might want to investigate the purpose of Unnamed: 0.**

**checking for missing values in each column using df.isnull().sum().** This step is crucial for identifying columns that may need imputation or removal of null values before further analysis.

In [None]:
print("Missing Values:")
print(dataset.isnull().sum())

a **histogram to visualize the distribution of the 'status' column (presumably representing the mental health status of individuals in the dataset). The goal here is to see how the statuses are spread out across the dataset, which can help reveal imbalances or trends.**

In [None]:
import plotly.express as px
fig = px.histogram(dataset, x='status', title='Distribution of Mental Health Status')
fig.show()


**filling missing values in the 'statement' column with empty strings (''). This ensures there are no null values in the 'statement' column, which is likely important for further text-based analysis.**

In [None]:
dataset['statement'] = dataset['statement'].fillna('')


In [None]:
print("Missing Values:")
print(dataset.isnull().sum())

**calculating the length of each text entry in the 'statement' column by splitting the text into words and counting them. This is stored in a new column called 'text_length'.**

In [None]:
dataset['text_length'] = dataset['statement'].apply(lambda x: len(str(x).split()))

**to relate text length to other variables (e.g., text length vs. mental health status), you can use a scatter plot to show how the length varies across different categories.**

In [None]:
fig = px.scatter(dataset, x='status', y='text_length', title='Text Length vs Mental Health Status')
fig.show()


In [None]:
import re
import string
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score
import plotly.figure_factory as ff
from textblob import TextBlob
import numpy as np
from wordcloud import WordCloud
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go

# Splitting the data

In [None]:
X = dataset['statement']
y = dataset['status']

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Vectorization

In [None]:
vectorizer = TfidfVectorizer(max_features=10000)
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)

# Model Training with Hyperparameter Tuning

In [None]:
param_grid = {
    'C': [0.01, 0.1, 1, 10, 100]
}

model = LogisticRegression(max_iter=10)
grid_search = GridSearchCV(model, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train_tfidf, y_train)

# Best Model
best_model = grid_search.best_estimator_

In [None]:
y_pred = best_model.predict(X_test_tfidf)

In [None]:
print("Best Parameters:")
print(grid_search.best_params_)

print("Accuracy Score:")
print(accuracy_score(y_test, y_pred))

print("Classification Report:")
print(classification_report(y_test, y_pred))

In [None]:
# Feature Importance
feature_names = vectorizer.get_feature_names_out()
coefs = best_model.coef_
for i, category in enumerate(best_model.classes_):
    top_features = coefs[i].argsort()[-10:]
    top_words = [feature_names[j] for j in top_features]
    top_scores = [coefs[i][j] for j in top_features]
    fig = go.Figure([go.Bar(x=top_words, y=top_scores)])
    fig.update_layout(title=f'Top Features for {category}')
    fig.show()

In [None]:
# Word Cloud
all_text = ' '.join(dataset['statement'])
wordcloud = WordCloud(width=800, height=400, background_color='white').generate(all_text)
plt.figure(figsize=(10, 5))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.title('Word Cloud of Cleaned Statements')
plt.show()

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Sample raw documents
docs = dataset

# Initialize vectorizer
vectorizer = TfidfVectorizer(
    lowercase=True,
    stop_words='english',   # optionally remove common stop words
    norm='l2',              # length normalization
    smooth_idf=True         # adds 1 to idf numerator/denominator
)

# Learn vocabulary and idf, then transform docs to TF-IDF matrix
tfidf_matrix = vectorizer.fit_transform(docs)

# Retrieve feature names (terms)
terms = vectorizer.get_feature_names_out()

# Convert to dense array and display
import pandas as pd
df_tfidf = pd.DataFrame(tfidf_matrix.toarray(), columns=terms)
print(df_tfidf)


In [None]:
import plotly.express as px
fig = px.pie(dataset, names='status', title='Proportion of Each Status Category')
fig.show()