<a href="https://colab.research.google.com/github/tarakantaacharya/Stock_Movement_Analysis/blob/main/Data_Preprocessing_Cleaning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Data preprocessing and Cleaning

####Instructions:
Just simply run the code without any extra details here

Note : Make sure you install the requirements.txt

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [2]:
#nltk download
!pip install nltk



In [None]:
# Importing pandas to work with data in DataFrame format (useful for data manipulation and analysis)
import pandas as pd
# Importing the 're' module for regular expressions (used for text cleaning and pattern matching)
import re
# Importing the 'string' module to access common string operations (e.g., removing punctuation)
import string
# Importing stopwords from NLTK (Natural Language Toolkit) to remove common words (like 'the', 'and') that add little meaning
from nltk.corpus import stopwords
# Importing word_tokenize from NLTK to split text into individual words (tokens)
from nltk.tokenize import word_tokenize
# Importing WordNetLemmatizer from NLTK to reduce words to their base or root form (lemmatization)
from nltk.stem import WordNetLemmatizer
# Importing TextBlob for sentiment analysis and text processing (it provides tools to work with text, like sentiment polarity)
from textblob import TextBlob
# Importing nltk to download and work with NLTK resources (e.g., downloading stopwords and tokenizers)
import nltk

# Purpose of this Script:
This code performs data preprocessing, cleaning, and sentiment analysis on Reddit posts related to stock data. It transforms raw text data into a structured format, making it easier to analyze or use in predictive models.

In [None]:
#Import NLTK Resources:
#These downloads ensure that the required NLTK data files are available for tokenization, stopword removal, and lemmatization.
nltk.download('punkt')      # Tokenizer models
nltk.download('stopwords')  # List of stopwords in different languages
nltk.download('wordnet')    # WordNet lemmatizer data
nltk.download('punkt_tab')  # Additional tokenization data (optional)

# Load your dataset (the scrapped data file from the previous step)
df_posts = pd.read_csv('reddit_stock_data_posts.csv')

# Initialize text processing tools
# Initialize NLP Tools:
# Sets up the lemmatizer and stopword list for text preprocessing.
lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))

# Text Cleaning Function:
# Converts text to lowercase for consistency.
# Removes URLs, mentions, non-alphabetic characters, and extra spaces.
# Tokenizes the text into individual words.
# Removes stopwords (common words like "the", "and" that don't add much meaning).
# Applies lemmatization to convert words to their root form (e.g., "running" becomes "run").
def clean_text(text):
    # Convert text to lowercase
    text = text.lower()

    # Remove URLs, mentions, and other unwanted characters
    text = re.sub(r'http\S+|www\S+|https\S+', '', text)  # Remove URLs
    text = re.sub(r'@\S+', '', text)  # Remove mentions
    text = re.sub(r'[^a-zA-Z\s]', '', text)  # Remove non-alphabetic characters

    # Remove extra whitespace and newlines
    text = re.sub(r'\s+', ' ', text).strip()

    # Remove stopwords and tokenize
    tokens = word_tokenize(text)
    tokens = [t for t in tokens if t not in stop_words]

    # Lemmatization (converting words to their base form)
    tokens = [lemmatizer.lemmatize(t) for t in tokens]

    # Join tokens back to a string
    return ' '.join(tokens)

# Function to handle missing values and clean text
# Clean Multiple Columns:
# Applies the clean_text function to columns with text data, handling missing values by converting them to empty strings.
def clean_column(column):
    return column.apply(lambda x: clean_text(str(x)) if isinstance(x, str) else '').fillna('')

# Apply text cleaning to title, content, and comments
df_posts['cleaned_title'] = clean_column(df_posts['title'])
df_posts['cleaned_content'] = clean_column(df_posts['content'])

# Function to calculate sentiment polarity (Positive, Negative, Neutral)
def get_sentiment(text):
    blob = TextBlob(text)
    # Return sentiment polarity: -1 (negative) to 1 (positive)
    return blob.sentiment.polarity

# Apply sentiment analysis to titles, content, and comments
df_posts['title_sentiment'] = df_posts['cleaned_title'].apply(get_sentiment)
df_posts['content_sentiment'] = df_posts['cleaned_content'].apply(get_sentiment)

# Function to classify sentiment as Positive, Negative, Neutral based on polarity
def classify_sentiment(polarity):
    if polarity > 0.1:
        return 'Positive'
    elif polarity < -0.1:
        return 'Negative'
    else:
        return 'Neutral'

# Apply sentiment classification
df_posts['title_sentiment_class'] = df_posts['title_sentiment'].apply(classify_sentiment)
df_posts['content_sentiment_class'] = df_posts['content_sentiment'].apply(classify_sentiment)

# Save the cleaned and processed data
df_posts.to_csv('reddit_stock_data_posts_cleaned.csv', index=False)

print("Data preprocessing and cleaning complete. Saved")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


Data preprocessing and cleaning complete. Saved


In [None]:
fd1 = pd.read_csv('reddit_stock_data_posts_cleaned.csv')  # Stored the data which is processed well into csv file
fd1.head()                                                # Gives the top 5 rows of datframe

Unnamed: 0,subreddit,title,content,score,num_comments,url,created_utc,upvote_ratio,author,cleaned_title,cleaned_content,title_sentiment,content_sentiment,title_sentiment_class,content_sentiment_class
0,WallStreetBets,Times Square right now,,489193,14013,https://v.redd.it/x64z70f7eie61,2021-01-30 18:00:38,0.99,SomeGuyInDeutschland,time square right,,0.285714,0.0,Positive,Neutral
1,WallStreetBets,UPVOTE so everyone sees we got SUPPORT,,338563,12843,https://i.redd.it/sgoqy8nyt2e61.png,2021-01-28 13:40:34,0.98,vrweensy,upvote everyone see got support,,0.0,0.0,Neutral,Neutral
2,WallStreetBets,GME YOLO update — Jan 28 2021,,300871,23007,https://i.redd.it/opzucppb15e61.png,2021-01-28 21:06:23,0.98,DeepFuckingValue,gme yolo update jan,,0.0,0.0,Neutral,Neutral
3,WallStreetBets,GME YOLO month-end update — Jan 2021,,264904,19896,https://i.redd.it/r557em3t5ce61.png,2021-01-29 21:04:45,0.98,DeepFuckingValue,gme yolo monthend update jan,,0.0,0.0,Neutral,Neutral
4,WallStreetBets,It’s treason then,,247637,4596,https://i.redd.it/d3t66lv1yce61.jpg,2021-01-29 23:40:59,0.98,keenfeed,treason,,0.0,0.0,Neutral,Neutral


In [None]:
fd1.shape   # Returns the shape of processed dataframe

(16947, 15)

Dataframe has 16947 rows and 15 columns which means 6 new columns are added extra into dataframe like 'cleaned_title',
       'cleaned_content', 'title_sentiment', 'content_sentiment',
       'title_sentiment_class', 'content_sentiment_class'....

In [None]:
fd1.columns   # columns of dataframe

Index(['subreddit', 'title', 'content', 'score', 'num_comments', 'url',
       'created_utc', 'upvote_ratio', 'author', 'cleaned_title',
       'cleaned_content', 'title_sentiment', 'content_sentiment',
       'title_sentiment_class', 'content_sentiment_class'],
      dtype='object')

In [None]:
# Count NaN values in each column
nan_count_per_column = fd1.isna().sum()

# Display the result
print("NaN values per column:")
nan_count_per_column

NaN values per column:


Unnamed: 0,0
subreddit,0
title,0
content,6821
score,0
num_comments,0
url,0
created_utc,0
upvote_ratio,0
author,1283
cleaned_title,42


We store the scrapped data into scarpped_data and processed data into cleaned_data for further use...

In [None]:
scrapped_data = pd.read_csv('reddit_stock_data_posts.csv')
cleaned_data = pd.read_csv('reddit_stock_data_posts_cleaned.csv')

Since the columns like "content", "author" , "cleaned_title" , "cleaned_content" have NaN values so We have to drop these rows...

In [None]:
fd1.dropna(inplace=True)   #Dropping the rows which has NaN values

In [None]:
# Count NaN values in each column
nan_count_per_column = fd1.isna().sum()

# Display the result
print("NaN values per column:")
nan_count_per_column

NaN values per column:


Unnamed: 0,0
subreddit,0
title,0
content,0
score,0
num_comments,0
url,0
created_utc,0
upvote_ratio,0
author,0
cleaned_title,0


Now the data is cleaned well...

In [None]:
fd1.shape   #The shape of cleaned dataframe

(9131, 15)

From 16947 rows , only 9131 rows are now in dataframe and remaining were eliminated due to NaN values...

Now the data is cleaned and preprocessed well Next step we will do Feature Extraction where we extract important features for better performance of model...