<a href="https://colab.research.google.com/github/tarakantaacharya/Stock_Movement_Analysis/blob/main/Feature_Engineering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Feature Engineering

####Instructions:
1. Make sure You run the requirements.txt file where it contains important pipelines for further processing...
2. Before running this file make sure you run the Data Scrapping and Data_PreProcessing_Cleaning file to get cleaned data


------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

####Purpose of this Script:
This code performs feature engineering on preprocessed Reddit post data. It creates new features from text data, such as word counts and sentiment metrics, to prepare the dataset for machine learning (ML) models.

In [29]:
#Importing NumPy:
#NumPy is imported to perform numerical operations, such as calculating averages.
import numpy as np

# Creating Text-Based Features:
# These features help quantify the text content in a structured format for ML models.

# 1. Word Count: Calculates the number of words in the cleaned title and content.
fd1['title_word_count'] = fd1['cleaned_title'].apply(lambda x: len(x.split()))
fd1['content_word_count'] = fd1['cleaned_content'].apply(lambda x: len(x.split()))

# 2. Character Count: Calculates the total number of characters in the title and content (excluding spaces).fd1['title_char_count'] = fd1['cleaned_title'].apply(lambda x: len(x.replace(" ", "")))
fd1['content_char_count'] = fd1['cleaned_content'].apply(lambda x: len(x.replace(" ", "")))

# 3. Average Word Length: Computes the average length of words in the title and content. This gives insight into the complexity of the text.fd1['title_avg_word_length'] = fd1['cleaned_title'].apply(lambda x: np.mean([len(word) for word in x.split()]) if len(x.split()) > 0 else 0)
fd1['content_avg_word_length'] = fd1['cleaned_content'].apply(lambda x: np.mean([len(word) for word in x.split()]) if len(x.split()) > 0 else 0)

# 4. Sentiment-Based Features: These features leverage the sentiment analysis performed earlier.

# **** Sentiment Scores: Uses the already-calculated sentiment polarity scores.
fd1['title_sentiment_score'] = fd1['title_sentiment']
fd1['content_sentiment_score'] = fd1['content_sentiment']

# **** Numeric Sentiment Classification: Converts sentiment labels into numeric values for ML algorithms (Positive = 1, Neutral = 0, Negative = -1).
def sentiment_to_numeric(sentiment):
    if sentiment == 'Positive':
        return 1
    elif sentiment == 'Negative':
        return -1
    else:
        return 0

fd1['title_sentiment_numeric'] = fd1['title_sentiment_class'].apply(sentiment_to_numeric)
fd1['content_sentiment_numeric'] = fd1['content_sentiment_class'].apply(sentiment_to_numeric)

# 5.Creating the Final Feature Set: This step selects the most important features to build the feature matrix X, which will be used as input for ML models.
feature_columns = [
    'title_word_count', 'content_word_count',
    'title_char_count', 'content_char_count',
    'title_avg_word_length', 'content_avg_word_length',
    'title_sentiment_score', 'content_sentiment_score',
    'title_sentiment_numeric', 'content_sentiment_numeric'
]

# Create the final feature matrix
X = fd1[feature_columns]

print("Feature Engineering and Data Preprocessing complete.")

Feature Engineering and Data Preprocessing complete.


####Key Benefits of Feature Engineering:
######Improves Model Accuracy:
Adding relevant features (like word count and sentiment) enhances the predictive power of ML models.
######Captures Text Characteristics:
Text-based features capture the length, complexity, and emotional tone of Reddit posts.
######Quantifies Sentiment:
Converting sentiment into numeric values enables ML algorithms to understand and process the emotional context.

These engineered features make it easier for machine learning models to detect patterns and trends, potentially improving their ability to forecast stock price movements based on Reddit discussions.

In [30]:
fd1.columns    #The new dataframe with additional columns

Index(['subreddit', 'title', 'content', 'score', 'num_comments', 'url',
       'created_utc', 'upvote_ratio', 'author', 'cleaned_title',
       'cleaned_content', 'title_sentiment', 'content_sentiment',
       'title_sentiment_class', 'content_sentiment_class', 'title_word_count',
       'content_word_count', 'title_char_count', 'content_char_count',
       'title_avg_word_length', 'content_avg_word_length',
       'title_sentiment_score', 'content_sentiment_score',
       'title_sentiment_numeric', 'content_sentiment_numeric'],
      dtype='object')

New columns are : 'title_word_count',
       'content_word_count', 'title_char_count', 'content_char_count',
       'title_avg_word_length', 'content_avg_word_length',
       'title_sentiment_score', 'content_sentiment_score',
       'title_sentiment_numeric', 'content_sentiment_numeric'

In [31]:
# Convert the 'created_utc' column to datetime format
fd1['time_date'] = pd.to_datetime(fd1['created_utc'])
# Extract only the date (year, month, day) from the datetime object
fd1['date'] = fd1['time_date'].dt.date

In [33]:
fd1[['time_date','date']].head()   #Gives the top 5 data rows

Unnamed: 0,time_date,date
5,2021-01-28 13:49:11,2021-01-28
28,2021-02-02 14:35:23,2021-02-02
48,2021-01-28 16:30:30,2021-01-28
65,2021-01-28 00:36:02,2021-01-28
91,2021-02-05 03:32:31,2021-02-05


In [35]:
# Convert the 'date' column to datetime format
fd1['date'] = pd.to_datetime(fd1['date'])

In [36]:
fd1['date'].dtype  # Identifying the datatype of date column

dtype('<M8[ns]')

####Why We Take yfinance dataset:
######1.Aligning Date Ranges (Reddit and Stock Data):

We extract the minimum and maximum dates from the Reddit data to define the range of interest for the stock data.
By downloading stock data within this same date range, we ensure that we're analyzing stock performance during the exact time frame when Reddit discussions (which we are analyzing for sentiment, trends, etc.) were happening.
This is crucial because we want to correlate Reddit post sentiment or discussions with the stock price movements. If the stock data is not aligned with the date range of the Reddit data, we can't properly compare them.
######2.Merging Data:

Once the stock data is downloaded and aligned with Reddit's date range, we can merge the datasets based on the date column.
This ensures that each Reddit post (or sentiment analysis result) has a corresponding stock data point (price, volume, etc.) on the same day.
Without aligning the dates, the data would not match up, and any analysis of how Reddit discussions impact stock prices would be invalid.
######3.Date Preprocessing (Ensuring Consistency):

We convert both Reddit and stock data dates into the same format (e.g., removing time from the datetime object), ensuring consistency in the merging process.
This prevents mismatches during the merge, as even small discrepancies in the date format can result in errors or incorrect data merging.



#####1.Importing yfinance:
The yfinance library is used to download stock data from Yahoo Finance. It allows access to historical stock prices, volume, dividends, and more.

#####2.Setting Date Range Based on Reddit Data:

min_date and max_date are derived from the Reddit dataset (fd1['date']), which determines the period we will download stock data for.
These dates are used to filter the stock data so that we only get data within the range of posts.
#####3.Downloading Stock Data:
The yf.download() function downloads historical stock data for a specific stock (in this case, 'AAPL' for Apple) from Yahoo Finance, using the date range defined earlier.

The start and end parameters ensure the data is limited to the same period as the Reddit posts.
#####4.Resetting the Index:
Stock data from yfinance has a multi-level index, where the Date is typically the index. The reset_index() method converts the index into regular columns.
This makes the DataFrame easier to work with, as 'Date' becomes a regular column instead of an index.

#####5.Ensuring 'Date' is in the Correct Format:
After resetting the index, the 'Date' column is converted to a datetime object and then to a date-only format using .dt.date. This removes any time components (like hours, minutes, and seconds) and ensures consistency with the date format in the Reddit data.

#####6.Verifying Date Range for Stock Data:
We print the minimum and maximum dates for the stock data to verify that the date range matches the one from the Reddit dataset.

#####7.Previewing Stock Data:
stock_data.head() shows the first few rows of the stock data so we can check the structure of the data, including columns like Date, Open, High, Low, Close, etc.

In [37]:
# Import the yfinance library to download stock data from Yahoo Finance
import yfinance as yf

# Get the minimum and maximum dates from the Reddit data to define the range for stock data
min_date = fd1['date'].min().date()  # Get the earliest date in the 'date' column of Reddit data
max_date = fd1['date'].max().date()  # Get the latest date in the 'date' column of Reddit data

# Print the date range for the Reddit data
print(f"Date Range for Reddit Data: {min_date} to {max_date}")

# Download stock data for AAPL (Apple) using yfinance, filtered by Reddit's date range
stock_data = yf.download('AAPL', start=min_date, end=max_date)

# Reset the index of the stock data to convert the multi-level index (Date and other data) into columns
stock_data.reset_index(inplace=True)

# Ensure the 'Date' column is in datetime format. This is done after resetting the index.
# We use .dt.date to convert the 'Date' column to a date object (removing time component).
stock_data['date'] = pd.to_datetime(stock_data['Date']).dt.date

# Print the date range for the stock data to ensure it matches the range from Reddit data
print(f"Date Range for Stock Data: {stock_data['Date'].min()} to {stock_data['Date'].max()}")

# Display the first few rows of the stock data
stock_data.head()

Date Range for Reddit Data: 2009-10-20 to 2024-12-05


[*********************100%***********************]  1 of 1 completed

Date Range for Stock Data: 2009-10-20 00:00:00+00:00 to 2024-12-04 00:00:00+00:00





Price,Date,Adj Close,Close,High,Low,Open,Volume,date
Ticker,Unnamed: 1_level_1,AAPL,AAPL,AAPL,AAPL,AAPL,AAPL,Unnamed: 8_level_1
0,2009-10-20 00:00:00+00:00,5.987978,7.098571,7.205357,7.066071,7.164286,1141039200,2009-10-20
1,2009-10-21 00:00:00+00:00,6.173559,7.318571,7.453929,7.115357,7.125714,1193726800,2009-10-21
2,2009-10-22 00:00:00+00:00,6.181998,7.328571,7.423214,7.2325,7.310714,791392000,2009-10-22
3,2009-10-23 00:00:00+00:00,6.144037,7.283571,7.35,7.258214,7.346429,420786800,2009-10-23
4,2009-10-26 00:00:00+00:00,6.10005,7.231429,7.383929,7.146429,7.273929,484338400,2009-10-26


In [38]:
# Flatten the MultiIndex columns by joining the levels with '_'
stock_data.columns = ['_'.join(col).strip() if col[1] else col[0] for col in stock_data.columns]
# View the flattened columns
stock_data.columns

Index(['Date', 'Adj Close_AAPL', 'Close_AAPL', 'High_AAPL', 'Low_AAPL',
       'Open_AAPL', 'Volume_AAPL', 'date'],
      dtype='object')

In [40]:
stock_data['date'] = pd.to_datetime(stock_data['date']) # Convert the 'date' column to datetime format

In [41]:
stock_data['date'].dtype   # Identifying the datatype of date column

dtype('<M8[ns]')

In [42]:
merged_data = pd.merge(fd1, stock_data, how='inner')   #Applying inner join where both stock_data and our scrapped_data(fd1) has common column "date"
merged_data.head()

Unnamed: 0,subreddit,title,content,score,num_comments,url,created_utc,upvote_ratio,author,cleaned_title,...,content_sentiment_numeric,time_date,date,Date,Adj Close_AAPL,Close_AAPL,High_AAPL,Low_AAPL,Open_AAPL,Volume_AAPL
0,WallStreetBets,CLASS ACTION AGAINST ROBINHOOD. Allowing peopl...,LEAVE ROBINHOOD. They dont deserve to make mon...,228428,17828,https://www.reddit.com/r/wallstreetbets/commen...,2021-01-28 13:49:11,0.97,does-it-mater,class action robinhood allowing people sell de...,...,0,2021-01-28 13:49:11,2021-01-28,2021-01-28 00:00:00+00:00,134.054138,137.089996,141.990005,136.699997,139.520004,142621100
1,WallStreetBets,"Hey everyone, Its Mark Cuban. Jumping on to do...",Lets Go !,159755,26297,https://www.reddit.com/r/wallstreetbets/commen...,2021-02-02 14:35:23,0.91,mcuban,hey everyone mark cuban jumping ama ask anything,...,0,2021-02-02 14:35:23,2021-02-02,2021-02-02 00:00:00+00:00,132.000687,134.990005,136.309998,134.610001,135.729996,83305400
2,WallStreetBets,Like this post if you are holding!!💎 The real ...,"Buy more during dips if you can, but at least ...",139432,6798,https://www.reddit.com/r/wallstreetbets/commen...,2021-01-28 16:30:30,0.97,uwillmire,like post holding real squeeze yet happen,...,0,2021-01-28 16:30:30,2021-01-28,2021-01-28 00:00:00+00:00,134.054138,137.089996,141.990005,136.699997,139.520004,142621100
3,WallStreetBets,Where do we go from here and who is going to s...,We have grown to the kind of size we only drea...,120942,27100,https://www.reddit.com/r/wallstreetbets/commen...,2021-01-28 00:36:02,0.92,zjz,go going step help u,...,0,2021-01-28 00:36:02,2021-01-28,2021-01-28 00:00:00+00:00,134.054138,137.089996,141.990005,136.699997,139.520004,142621100
4,WallStreetBets,"Mark Cuban said ""once the brokerage stops rest...",Let's show them what we're made of: retards.,105158,5892,https://www.reddit.com/r/wallstreetbets/commen...,2021-02-05 03:32:31,0.88,crispizzle,mark cuban said brokerage stop restricting tra...,...,-1,2021-02-05 03:32:31,2021-02-05,2021-02-05 00:00:00+00:00,133.931274,136.759995,137.419998,135.860001,137.350006,75693800


In [43]:
merged_data.shape    #Identifying the shape of merged_data

(6825, 34)

In [44]:
# Count NaN values in each column
nan_count_per_column = merged_data.isna().sum()

# Display the result
print("NaN values per column:")
nan_count_per_column

NaN values per column:


Unnamed: 0,0
subreddit,0
title,0
content,0
score,0
num_comments,0
url,0
created_utc,0
upvote_ratio,0
author,0
cleaned_title,0


In [45]:
# Create a new feature for stock price change (daily % change)
merged_data['Price_Change'] = merged_data['Close_AAPL'].pct_change()

# Calculate moving averages (e.g., 3-day and 7-day averages)
merged_data['3d_MA'] = merged_data['Close_AAPL'].rolling(window=3).mean()
merged_data['7d_MA'] = merged_data['Close_AAPL'].rolling(window=7).mean()

# Lag features (previous day’s price change)
merged_data['Prev_Price_Change'] = merged_data['Price_Change'].shift(1)

In [46]:
import pandas as pd

# Count NaN values in each column
nan_count_per_column = merged_data.isna().sum()

# Display the result
print("NaN values per column:")
nan_count_per_column

NaN values per column:


Unnamed: 0,0
subreddit,0
title,0
content,0
score,0
num_comments,0
url,0
created_utc,0
upvote_ratio,0
author,0
cleaned_title,0


We have to drop some rows beacause of NaN values existence

In [47]:
# Drop rows with NaN values generated from rolling calculations
merged_data.dropna(inplace=True)

In [48]:
# Count NaN values in each column
nan_count_per_column = merged_data.isna().sum()

# Display the result
print("NaN values per column:")
nan_count_per_column

NaN values per column:


Unnamed: 0,0
subreddit,0
title,0
content,0
score,0
num_comments,0
url,0
created_utc,0
upvote_ratio,0
author,0
cleaned_title,0


In [49]:
merged_data.shape   # The shape of new feature extracted and merged data

(6819, 38)

Now the data has 6819 rows and 38 columns wihout any NaN values

In [50]:
# Create 'stock_direction' column: 1 if price increased, 0 if decreased
merged_data['stock_direction'] = (merged_data['Price_Change'] > 0).astype(int)

In [52]:
merged_data['stock_direction'].value_counts()    # Identifying the count of stock_direction in our dataset

Unnamed: 0_level_0,count
stock_direction,Unnamed: 1_level_1
0,3752
1,3067


3752 rows are having 0(decrease in stock_direction) and 3067 rows are having 1(increase in stock_direction)

In [53]:
reddit_stock_direction = merged_data

In [54]:
reddit_stock_direction.to_csv('reddit_stock_direction.csv', index=False)  # we have kept the data into csv file for model prediction

Since balanced data is needed to predict better for a model So we made a balanced data

In [55]:
from sklearn.utils import resample

# Separate majority and minority classes
majority = merged_data[merged_data['stock_direction'] == 0]
minority = merged_data[merged_data['stock_direction'] == 1]

# Oversample minority class
minority_oversampled = resample(minority,
                                replace=True,     # Sample with replacement
                                n_samples=len(majority),  # Match majority class size
                                random_state=42)

# Combine majority class with oversampled minority class
balanced_data = pd.concat([majority, minority_oversampled])

# Shuffle data
balanced_data = balanced_data.sample(frac=1, random_state=42)

# Check class distribution
balanced_data['stock_direction'].value_counts()

Unnamed: 0_level_0,count
stock_direction,Unnamed: 1_level_1
0,3752
1,3752


Now the data is balanced

In [56]:
balanced_reddit_data = balanced_data
balanced_reddit_data.to_csv('balanced_reddit_data.csv', index=False)
# stored the balanced data for model prediction

In [57]:
balanced_reddit_data.columns    #These are columns of balanced_data

Index(['subreddit', 'title', 'content', 'score', 'num_comments', 'url',
       'created_utc', 'upvote_ratio', 'author', 'cleaned_title',
       'cleaned_content', 'title_sentiment', 'content_sentiment',
       'title_sentiment_class', 'content_sentiment_class', 'title_word_count',
       'content_word_count', 'title_char_count', 'content_char_count',
       'title_avg_word_length', 'content_avg_word_length',
       'title_sentiment_score', 'content_sentiment_score',
       'title_sentiment_numeric', 'content_sentiment_numeric', 'time_date',
       'date', 'Date', 'Adj Close_AAPL', 'Close_AAPL', 'High_AAPL', 'Low_AAPL',
       'Open_AAPL', 'Volume_AAPL', 'Price_Change', '3d_MA', '7d_MA',
       'Prev_Price_Change', 'stock_direction'],
      dtype='object')

In [58]:
balanced_reddit_data.head(1)    # returning the top row of balanced_data dataframe

Unnamed: 0,subreddit,title,content,score,num_comments,url,created_utc,upvote_ratio,author,cleaned_title,...,Close_AAPL,High_AAPL,Low_AAPL,Open_AAPL,Volume_AAPL,Price_Change,3d_MA,7d_MA,Prev_Price_Change,stock_direction
4223,financialindependence,"28M, single, cross $100k net worth this year. ...","*crossed, I proofread my entire post but forgo...",5156,462,https://www.reddit.com/r/financialindependence...,2017-10-04 00:34:19,0.84,FIthrow222,single cross k net worth year college degree n...,...,38.369999,38.465,38.115002,38.407501,80655200,-0.268969,48.057499,69.900357,-0.015521,0


Here the Feature Engineering step is done.... We prepared a well structurized data with required columns. Next step is Model Building with the dependency of this cleaned dataset