### GROUP MEMBER:
**1. Van Le:** vanle@buffalo.edu 

**2. Maria Anthony:** mariniv@buffalo.edu - CSE 587

**3. Anushka:** atiwari4@buffalo.edu - CSE 587

# PHASE II  -  MODEL DEVELOPMENT

#### Import necessary libraries

In [None]:
# !pip install pmdarima
# !pip install xgboost

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
from statsmodels.tsa.stattools import adfuller 
from statsmodels.tsa.arima.model import ARIMA
from pmdarima import auto_arima
import time
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestRegressor, StackingRegressor
from sklearn.svm import SVR, LinearSVR
import xgboost as xgb
from xgboost import plot_importance, plot_tree
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score, mean_absolute_percentage_error, confusion_matrix
from sklearn.metrics.pairwise import linear_kernel, cosine_similarity
from sklearn.model_selection import GridSearchCV
from pandas.plotting import lag_plot
 
# from tensorflow.keras.models import Sequential
# from tensorflow.keras.layers import LSTM, Dense, Dropout


#### Read the csv file we have from the Phase I

In [None]:
merged_df = pd.read_csv("stock-feature.csv")
merged_df.head()

In [None]:
merged_df['Date'] = pd.to_datetime(merged_df['Date'])

# Filter the DataFrame for the years of 2017
start_date = pd.to_datetime('2017-01-01')
end_date = merged_df['Date'].max()
filtered_df = merged_df[(merged_df['Date'] >= start_date) & (merged_df['Date'] <= end_date)]
filtered_df = filtered_df.reset_index(drop=False)
filtered_df = filtered_df.drop(['level_0', 'index'], axis=1)
filtered_df

In [None]:
list(filtered_df.columns)

Considering our dataset contains over 4 million data rows, each comprising 30 features that encompass historical stock prices and technical indicators, we aim to extensively implement data scaling and examine the varying effects of different scaling techniques on dimension reduction via PCA. The transformed data from PCA will be supply as input features to several stock price prediction models, the integration of dimensionality reduction may enhance the accuracy of the forecasting model and gives more accurate prediction for the stock price in the future [2,3].

### Comparision Scaling Techniques on PCA-based model

In [None]:
columns_for_pca = ['Open','High','Low','Volume','Market Cap','% Change',
                   'SMA','EMA','ADX','MACD','MACD Signal Line','RSI',
                   'Bollinger_Bands_Middle','Bollinger_Bands_Upper','Bollinger_Bands_Lower',
                   'KAMA','MFI','Tenkan_Sen','Kijun_Sen','Senkou_SpanA','Senkou_SpanB','Momentum',
                   '%K','%D','Chaikin_AD','ROC','ATR','Normalized_ATR','OBV']

data_for_pca = filtered_df[columns_for_pca]

In [None]:
# Create histograms for each feature
data_for_pca.hist(figsize=(15, 12), bins=20)
plt.subplots_adjust(hspace=0.5)
plt.suptitle('Figure 1. Distribution of each feature', fontsize=16, color='b')
plt.show()

**Observations:**
From the figure 1 we have
- Skewed Distributions: Certain features such as `% Change`, `MACD Signal Line`, `Momentum`, and `Chaikin A/D` are quite skewed. Skewed distributions are often better handled by scalers that are robust to outliers, we consider **Robust Scaler**.
- Bimodal Distributions: Features like `Volume`, `MACD`, and `OBV` show two peaks. This could indicate that different scaling approaches may affect these features in significant ways. 
- Features with Outliers: `Open`, `High`, `Low`, `Market Cap`, and `ATR`, we want to diminish the influence of outliers, then **Min-Max Scaler** and **Robust Scaler** may be good options. 
- Normal-like Distributions: `KAMA`, `MFI`, and the `Bollinger Bands` appear to have a distribution close to normal, these indicators could be scaled effectively with the **Standard Scaler**.

In [None]:
pca_table = pd.DataFrame(index=['PC1', 'PC2', 'PC3', 'PC4', 'PC5'])

# Define different Scaler methods
scalers = {
    'Standard Scaler': StandardScaler(),
    'MinMaxScaler': MinMaxScaler(),
    'RobustScaler': RobustScaler()
}

pca_results = {}

# Applying each method and perform PCA
for name, scaler in scalers.items():
    scaled_data = pd.DataFrame(scaler.fit_transform(data_for_pca))
    num_components = 5
    pca = PCA(n_components=num_components)
    pca_fit = pca.fit_transform(scaled_data)

    # Store the results in a dictionary
    pca_results[name] = {
        'scores': pca_fit, 
        'components': pca.components_,  
        'explained_variance_ratio': pca.explained_variance_ratio_ 
    }
    
    pca_table[name] = pca.explained_variance_ratio_

In [None]:
def biplot(score, coeff, labels=None, name=''):
    plt.figure(figsize=(8,4))  
    xs = score[:,0]
    ys = score[:,1]
    n = coeff.shape[0]
    scale_x = 1.0/(xs.max() - xs.min())
    scale_y = 1.0/(ys.max() - ys.min())
    
    plt.scatter(xs * scale_x,ys * scale_y)
    
    for i in range(n):
        plt.arrow(0, 0, coeff[i,0], coeff[i,1],color='r',alpha=0.5)
        if labels is None:
            plt.text(coeff[i,0]* 1.15, coeff[i,1] * 1.15, "Var"+str(i+1), color='g', ha='center', va='center')
        else:
            plt.text(coeff[i,0]* 1.15, coeff[i,1] * 1.15, labels[i], color='g', ha='center', va='center')
    
    plt.xlabel(f"PC{1}")
    plt.ylabel(f"PC{2}")
    plt.title(f'Biplot using {name}', fontsize=14, color='b')
    plt.grid()
    plt.show()  

In [None]:
# Plot the biplot for each scaler method
for name, result in pca_results.items():
    biplot(result['scores'][:,0:2], np.transpose(result['components'][0:2, :]), name=name)
plt.suptitle('Figure 2. Biplot for each Scaler Method', fontsize=16)

In [None]:
print(pca_table)

**Observation - From biplots in Figure 2:**

- StandardScaler: The plot shows a reasonable distribution of data points and component vectors (arrows), indicating a balanced representation of the dataset. The component vectors are well spread out, combining with `explained_variance_ratio_` with no single component overwhelmingly dominating. This might suggest that the data, when standardized, has its variance is captured effectively across multiple dimensions.
- MinMaxScaler: The data points in this plot are compressed into a very narrow range on the x-axis. This could be because MinMaxScaler is sensitive to outliers, and when the data has outliers, it can cause a compression effect, which may not be ideal for PCA. 
- RobustScaler: In the last plot, a distribution that similar to StandardScaler, but with less extreme variation on the axes. The component vectors are also well distributed. RobustScaler is less sensitive to outliers, which suggests it's capturing the intrinsic spread of the data more effectively without allowing outliers to dictate the scale. Additionally, the first principal component `PC1` accounts for approximately 80% of the variance, which is much higher than with the other two methods. The remaining components contribute significantly less, which indicates that after mitigating the influence of outliers, the first component becomes highly dominant in explaining the variance.

We can move forward with the $RobustScaler$ since it might be the most significant scaler for this PCA analysis because it appears to handle outliers better than other two scaler, ensuring that the principal components are not skewed by extreme values. The distribution of points can suggest that the inherent structure of the data is maintained. 

In [None]:
# Perform the Robust Scaler transformation
transformer = RobustScaler()
scaled_arr = transformer.fit_transform(data_for_pca)
scaled_df = pd.DataFrame(scaled_arr, columns=data_for_pca.columns)

# Proceed with the PCA transformation
num_components = 5
pca = PCA(n_components=num_components)
pca_result = pca.fit_transform(scaled_df)
pca_result_df = pd.DataFrame(pca_result, columns=['Feature_1', 'Feature_2', 'Feature_3', 'Feature_4', 'Feature_5'])
pca_result_df.head()

In [None]:
# Heatmap of Component Loadings
plt.figure(figsize=(8,8))
sns.heatmap(pca.components_,
            cmap='magma',
            yticklabels=["PC"+str(x) for x in range(1,len(pca.components_)+1)],
            xticklabels=columns_for_pca,
            cbar_kws={"orientation": 'vertical'})
plt.tight_layout()
plt.title('Figure 3. Heatmap of Component Loadings', size=16)
plt.show()

In [None]:
target = filtered_df[['Date', 'Ticker', 'Close']]
target.index = pca_result_df.index
transformed_df = pd.concat([pca_result_df, target], axis=1)
transformed_df

In [None]:
import itertools

fig, axes = plt.subplots(num_components, num_components, figsize=(12, 12))
for i, j in itertools.product(range(num_components), range(num_components)):
    ax = axes[i, j]
    if i != j:
        ax.scatter(scaled_arr[:, i], scaled_arr[:, j])
        ax.set_xlabel(f'PC{i+1}')
        ax.set_ylabel(f'PC{j+1}')
    else:
        ax.annotate(f'PC{i+1}', (0.5, 0.5), xycoords='axes fraction', ha='center', va='center')
        ax.axis('off')

plt.tight_layout()
plt.suptitle("Figure 4. Comparison of each Principal Component", fontsize=14, color='r', y=1.05)
plt.show()

In [None]:
transformed_df.to_csv('//Users//vanle//Downloads//pca_stock_result.csv', index=False)

In [None]:
transformed_df.isnull().sum()

# Read the csv file we have processed in the previous step

In [2]:
stock_df = pd.read_csv('pca_stock_result.csv')
stock_df['Date'] = pd.to_datetime(stock_df['Date'])
stock_df.set_index('Date', inplace=True)
stock_df

Unnamed: 0_level_0,Feature_1,Feature_2,Feature_3,Feature_4,Feature_5,Ticker,Close
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2017-01-03,-4.694621,0.050102,0.525245,-4.551044,1.058598,A,46.178
2017-01-04,-4.948428,0.293337,0.468583,-4.553871,1.065424,A,46.784
2017-01-05,-4.885045,0.113366,0.636895,-4.541259,1.020767,A,46.228
2017-01-06,-5.465603,0.731387,0.340322,-4.613235,1.154764,A,47.668
2017-01-09,-5.977897,1.172599,0.107106,-4.593020,1.132056,A,47.817
...,...,...,...,...,...,...,...
2017-11-06,-6.903199,-1.760332,-1.418393,4.088618,0.799909,ZYNE,11.190
2017-11-07,-6.796380,-1.978180,-1.165790,4.113964,0.734404,ZYNE,10.830
2017-11-08,-6.773013,-2.083958,-1.029962,4.123684,0.727150,ZYNE,10.900
2017-11-09,-7.115683,-1.690110,-1.202882,4.137377,0.783180,ZYNE,11.600


## RobustScaler - PCA-based Stock Price Prediction Models Development 

#### Splitting data into training and testing set

In [3]:
stock_df.sort_values(['Ticker', 'Date'], inplace=True)
stock_df['row_num'] = stock_df.groupby('Ticker').cumcount() + 1

# Calculating total count per partition
partition_counts = stock_df.groupby('Ticker').size().reset_index(name='partition_count')

# Merging the counts with the DataFrame
stock_df_with_counts = pd.merge(stock_df, partition_counts, on='Ticker', how='left')

# Calculate the thresholds for train, test, and valid partitions
stock_df_with_counts['partition_threshold'] = stock_df_with_counts['partition_count'] * 0.7
stock_df_with_counts['test_threshold'] = stock_df_with_counts['partition_count'] * 0.85

# Filter the DataFrame per partition based on the thresholds
train_data = stock_df_with_counts[stock_df_with_counts['row_num'] <= stock_df_with_counts['partition_threshold']]
test_data = stock_df_with_counts[(stock_df_with_counts['row_num'] > stock_df_with_counts['partition_threshold']) &
                                (stock_df_with_counts['row_num'] <= stock_df_with_counts['test_threshold'])]
valid_data = stock_df_with_counts[stock_df_with_counts['row_num'] > stock_df_with_counts['test_threshold']]

# Lower all column names 
train_data.columns = map(str.lower, train_data.columns)
test_data.columns = map(str.lower, test_data.columns)
valid_data.columns = map(str.lower, valid_data.columns)

# Drop all unnessary columns 
train_data = train_data.drop(columns=['row_num', 'partition_count', 'partition_threshold', 'test_threshold'], axis=1)
test_data = test_data.drop(columns=['row_num', 'partition_count', 'partition_threshold', 'test_threshold'], axis=1)
valid_data = valid_data.drop(columns=['row_num', 'partition_count', 'partition_threshold', 'test_threshold'], axis=1)

# Split into X and y 
X_train = train_data.drop('close', axis=1)
X_test = test_data.drop('close', axis=1)
X_valid = valid_data.drop('close', axis=1)
y_train = train_data[['close']]
y_test = test_data[['close']]
y_valid = valid_data[['close']]
print("Shape of X_train:", X_train.shape)
print("Shape of X_test:", X_test.shape)
print("Shape of X_valid:", X_valid.shape)
print("Shape of y_train:", y_train.shape)
print("Shape of y_test:", y_test.shape)
print("Shape of y_valid:", y_valid.shape)

Shape of X_train: (269495, 6)
Shape of X_test: (58470, 6)
Shape of X_valid: (58601, 6)
Shape of y_train: (269495, 1)
Shape of y_test: (58470, 1)
Shape of y_valid: (58601, 1)


In [4]:
# Perform label encoder
label_encoder = LabelEncoder()
train_ticker_encoder = label_encoder.fit_transform(X_train['ticker'])
train_ticker_encoder = pd.DataFrame(train_ticker_encoder, columns=['ticker'])
test_ticker_encoder = label_encoder.fit_transform(X_test['ticker'])
test_ticker_encoder = pd.DataFrame(test_ticker_encoder, columns=['ticker'])
valid_ticker_encoder = label_encoder.fit_transform(X_valid['ticker'])
valid_ticker_encoder = pd.DataFrame(valid_ticker_encoder, columns=['ticker'])

# Drop the original 'ticker' column
X_train.drop(['ticker'],axis=1,inplace=True)
X_test.drop(['ticker'],axis=1,inplace=True)
X_valid.drop(['ticker'],axis=1,inplace=True)
print("X_train columns check:", X_train.columns)
print("X_test columns check:", X_test.columns)
print("X_valid columns check:", X_valid.columns)

# Reset the index of the training and testing data
X_train = X_train.reset_index(drop=True)
X_test = X_test.reset_index(drop=True)
X_valid = X_valid.reset_index(drop=True)
train_ticker_encoder = train_ticker_encoder.reset_index(drop=True)
test_ticker_encoder = test_ticker_encoder.reset_index(drop=True)
valid_ticker_encoder = valid_ticker_encoder.reset_index(drop=True)

# Concatenate the encoded 'ticker' with the original DataFrames
X_train = pd.concat([X_train, train_ticker_encoder], axis=1)
X_test = pd.concat([X_test, test_ticker_encoder], axis=1)
X_valid = pd.concat([X_valid, valid_ticker_encoder], axis=1)

print(f"Sum of null value if exist in X_train:", X_train.isnull().sum())
print(f"Sum of null value if exist in X_test:", X_test.isnull().sum())
print(f"Sum of null value if exist in X_valid:", X_valid.isnull().sum())
print(f"Shape of training features:", X_train.shape)
print(f"Shape of test features:", X_test.shape)
print(f"Shape of valid features:", X_valid.shape)
print(f"Shape of training target:", y_train.shape)
print(f"Shape of test target:", y_test.shape)
print(f"Shape of valid target:", y_valid.shape)
print(f"Train set: ")
print(X_train.tail())
print(f"Test set:" )
print(X_test.tail())
print(f"Target:" )
print(y_test.tail())

<IPython.core.display.Javascript object>

X_train columns check: Index(['feature_1', 'feature_2', 'feature_3', 'feature_4', 'feature_5'], dtype='object')
X_test columns check: Index(['feature_1', 'feature_2', 'feature_3', 'feature_4', 'feature_5'], dtype='object')
X_valid columns check: Index(['feature_1', 'feature_2', 'feature_3', 'feature_4', 'feature_5'], dtype='object')
Sum of null value if exist in X_train: feature_1    0
feature_2    0
feature_3    0
feature_4    0
feature_5    0
ticker       0
dtype: int64
Sum of null value if exist in X_test: feature_1    0
feature_2    0
feature_3    0
feature_4    0
feature_5    0
ticker       0
dtype: int64
Sum of null value if exist in X_valid: feature_1    0
feature_2    0
feature_3    0
feature_4    0
feature_5    0
ticker       0
dtype: int64
Shape of training features: (269495, 6)
Shape of test features: (58470, 6)
Shape of valid features: (58601, 6)
Shape of training target: (269495, 1)
Shape of test target: (58470, 1)
Shape of valid target: (58601, 1)
Train set: 
        feat

In [5]:
# Convert to a one-dimensional NumPy array
y_train = y_train.values.ravel()
y_test = y_test.values.ravel()
y_valid = y_valid.values.ravel()
print(type(y_train))
print(y_train.shape)
print(type(y_test))
print(y_test.shape)
print(type(y_valid))
print(y_valid.shape)

<class 'numpy.ndarray'>
(269495,)
<class 'numpy.ndarray'>
(58470,)
<class 'numpy.ndarray'>
(58601,)


### 1. ARIMA Model

#### Test for stationarity:
The data is stationary if they do not have trend or any seasonal effects. And if the data is non-stationary, then we have to convert it to stationary data before fitting into the ARIMA model. To check whether the data is stationary, we will use Augmented Dicky-Fuller (ADF) test.

The ADF test, also known as the “unit root test”, is a statistical test to inform the degree to which a null hypothesis can be rejected or fail to reject. The p-value below a threshold (1%, 5%, 10%) suggests we can reject the null hypothesis [2]. 

- Null Hypothesis \(H<sub>0</sub>\): If failed to be rejected, it suggests the time series has a unit root, meaning it is non-stationary.
- Alternative Hypothesis \(H<sub>1</sub>): The null hypothesis is rejected and suggests the time series does not have a unit root, meaning it is stationary.


In [None]:
# Checking the stationarity
def adf_test(timeseries):
    # # Determing rolling statistics
    # moving_average = timeseries.rolling(12).mean()
    # moving_std = timeseries.rolling(12).std()
    # Perform Dickey-Fuller test:
    adft = adfuller(timeseries, autolag='AIC')
    # Extract and display test results in a Series
    output = pd.Series(adft[0:4], 
                       index=['Test Statistics', 'p-value', 'No. of lags used', 'Number of observations used'])
    
    for key, value in adft[4].items():
        output["Critical value (%s)" % key] = value
    print(output)

#  Check if y_train is stationary 
adf_test(y_train)

After performing Augmented Dickey-Fuller (ADF) test, we got the test statistics of $-42.395700$, which is much lower than the critical values at 1%, 5%, and 10% levels. The p-value is essentially 0, which is below the common alpha level of 0.05. This strongly suggests that we can reject the null hypothesis and conclude that our time series data is stationary.

We will apply `auto_arima` function from the `pmdarima` library since it is a useful tool for automatically determining the optimal parameters for an ARIMA model based on the provided time series data. Then, we will use the returned parameters to fit our ARIMA model.

#### Fit the Auto-ARIMA Model

In [None]:
# Fit the auto_arima model
def fit_auto_arima(train_data, initial_p, initial_q):
    stepwise_fit = auto_arima(train_data, 
                              start_p=1, start_q=1,
                              max_p=initial_p, max_q=initial_q, 
                              m=1, # we want to imply non-seasonal, so we use m=1 
                              start_P=0, 
                              seasonal=False,
                              d=None, 
                              D=0, 
                              trace=True,
                              error_action='ignore',
                              suppress_warnings=True,
                              stepwise=True)
    return stepwise_fit

In [None]:
auto_arima_model = fit_auto_arima(y_train,5,5)
print(auto_arima_model.summary())

In [None]:
auto_arima_model.plot_diagnostics(figsize=(14,10))
plt.show()

ARIMA is a class of models used for time series forecasting and analysis. It combines autoregressive (AR), differencing (I), and moving average (MA) components to model and predict time series data.

**5**: This is the number of autoregressive (AR) terms in the model. The autoregressive component captures the relationship between the current value in the time series and its past values. In this case, there are 5 lagged (previous) values of the time series that will be used as predictors in the model.

**1**: This is the differencing order (I) in the model. The differencing component is used to make the time series stationary by taking differences between consecutive observations. A value of 1 indicates that first-order differencing is applied, which means that each value in the time series is replaced by the difference between it and the previous value.

**1**: This is the number of moving average (MA) terms in the model. The moving average component models the relationship between the current value and past white noise (random) error terms. In this case, there is 1 lagged white noise error term included in the model.

In summary, ARIMA(5,1,1) is 5 autoregressive terms, 1 order of differencing, and 1 moving average term. And all coefficients being statistically significant. 

<!-- There are signs of non-normality and heteroskedasticity in the residuals according to the Jarque-Bera and heteroskedasticity tests -->

In [None]:
def fit_arima(train_data, p, d, q):
    arima_model = ARIMA(train_data, order=(p, d, q))
    arima_result = arima_model.fit()
    return arima_result

# Buid and fit ARIMA model with order (p,d,q)=(5,1,1)
arima_result_1 = fit_arima(y_train, 5, 1, 1)
print(arima_result_1.summary())

# Forecasting stock prices on the test 
arima_forecast_1 = arima_result_1.get_forecast(steps=len(y_test))
arima_forecast_values_1 = arima_forecast_1.predicted_mean  
conf_int_1 = arima_forecast_1.conf_int()

# Plot forecast results against actual data 
plt.figure(figsize=(7, 5))
plt.plot(y_test, label='Actual Test Data')  # Plot y_test directly
plt.plot(y_test, arima_forecast_values_1, label='ARIMA(5,1,1) Forecast')
plt.fill_between(
    range(len(y_test)),
    conf_int_1[:, 0],  
    conf_int_1[:, 1],  
    color='pink',
    alpha=0.3,
    label='95% Confidence Interval'
)
plt.title("Figure 5. ARIMA(5,1,1) Forecast vs Actual Test Data", fontsize=14, color='r')
plt.xlabel("Time Step")
plt.ylabel("Values")
plt.legend(loc='upper left', fontsize=10)
plt.show()

In [None]:
# Since it's too large dataset, kernel was interruptted during middle of training 
auto_arima_model = fit_auto_arima(y_train,6,6)
print(auto_arima_model.summary())

Even kernel was interrupted, we still can see ARIMA(4,1,3) has smaller AIC than any other param values. We could fit in on our training to see if this model make any significant improvement on forecasting.

In [None]:
# Fit the ARIMA model with order (p,d,q)=(4,1,3)
arima_result_2 = fit_arima(y_train, 4, 1, 3)
print(arima_result_2.summary())

# Make forecasts
arima_forecast_2 = arima_result_2.get_forecast(steps=len(y_test))
arima_forecast_values_2 = arima_forecast_2.predicted_mean  
conf_int_2 = arima_forecast_2.conf_int()

# Plot predict results against the actual results
plt.figure(figsize=(7, 5))
plt.plot(y_test, label='Actual Test Data')  # Plot y_test directly
plt.plot(y_test, arima_forecast_values_2, label='ARIMA(4,3,1) Forecast')
plt.fill_between(
    range(len(y_test)),
    conf_int_2[:, 0],  
    conf_int_2[:, 1],  
    color='pink',
    alpha=0.3,
    label='95% Confidence Interval'
)
plt.title("Figure 6. ARIMA(4,3,1) Forecast vs Actual Test Data", fontsize=14, color='r')
plt.xlabel("Time Step")
plt.ylabel("Values")
plt.legend(loc='upper left', fontsize=10)
plt.show()

The best ARIMA model according to the Akaike Information Criterion (AIC), is an ARIMA(4,1,3) without a seasonal component. This model includes 4 autoregressive terms, a differencing order of 1, and 3 moving average term.

- `IC`: The AIC value is 3270235.960, as the lower the AIC, the better the model fits time series data while penalizing for complexity.
- `Coefficients`: All the coefficients for the autoregressive (AR) and moving average (MA) terms are significant (as $p-value < 0.05$), as indicated by the z-test.
- `Log Likelihood`: The log likelihood of the model is quite large in the negative, indicating the likelihood of the observed data given the model.
- `Ljung-Box Test`: The Ljung-Box test on residuals is a way to check for lack of fit. In this case, with a $p-value = 0.61$, there is no evidence of lack of fit.
- `Jarque-Bera (JB) Test`: The JB test is extremely large, indicating that the residuals are not normally distributed. However, for large sample sizes, this test may always indicate non-normality.
- `Heteroskedasticity (H)`: The test statistic is very high, indicating that there is heteroskedasticity in the residuals.

**Comparision between ARIMA(5,1,1) vs. ARIMA(4,1,3)**

- Based on these result tables, both models are relatively close in terms of fit, with ARIMA(4,1,3) having a slightly better log likelihood, AIC, and BIC compares to ARIMA(5,1,1). However, the differences are not significant, and the choice between these models may depend on other factors, such as the interpretability of the model coefficients, 

- In ARIMA(5,1,1), the AR coefficients (ar.L1 to ar.L5) are all negative, indicating a decreasing trend in autocorrelation with lag. While ARIMA(4,1,3), the AR coefficients (ar.L1 to ar.L4) are a mix of positive and negative values, suggesting a more complex relationship.

- The standard errors in both models are relatively small, indicating that the parameter estimates are precise. 
- The z-scores in ARIMA(4,3,1) are very large in absolute terms, and even larger than ARIMA(5,1,1) indicating that the coefficients are highly significant.

In overall, ARIMA(4,3,1) may have little better fit to the data. However, ARIMA(5,1,1) is simpler model so it might be easy for high variety dataset. We'll evaluate the metrics to get more insights from the result for both forecast models. 

In [None]:
def forecast_bias(y_true, y_pred):
    return np.mean(y_pred - y_true)

In [None]:
def mean_absolute_percentage_error(y_true, y_pred):
    return np.mean(np.abs((y_true - y_pred) / y_true)) * 100

In [None]:
# Evaluate model performance: MAE, RMSE, MAPE
mae1 = mean_absolute_error(y_test, arima_forecast_values_1)
mae2 = mean_absolute_error(y_test, arima_forecast_values_2)

rmse1 = np.sqrt(mean_squared_error(y_test, arima_forecast_values_1))
rmse2 = np.sqrt(mean_squared_error(y_test, arima_forecast_values_2))

mape1 = mean_absolute_percentage_error(y_test, arima_forecast_values_1)
mape2 = mean_absolute_percentage_error(y_test, arima_forecast_values_2)

bias1 = forecast_bias(y_test, arima_forecast_values_1)
bias2 = forecast_bias(y_test, arima_forecast_values_2)

# Calculate forecast accuracy
accuracy1 = 1 - (mape1 / 100)
accuracy2 = 1 - (mape2 / 100)

# Create a comparision dataframe 
models = ['ARIMA(4,3,1)', 'ARIMA(5,1,1)']
mae = [mae2, mae1]
rmse = [rmse2, rmse1]
mape = [mape2, mape1]
bias = [bias2, bias1]
accuracy = [accuracy2, accuracy1]

metrics_dict = {
    "Model": models,
    "MAE": mae,
    "RMSE": rmse,
    "MAPE": mape,
    "Forecast Bias": bias,
    "Forecast Accuracy": accuracy
}

metrics_df = pd.DataFrame(metrics_dict)
metrics_df

In [None]:
# Get the last 50 points of y_test and the corresponding forecasted values and confidence interval
y_test_last_50 = y_test[-100:]
forecast_values_last_50 = arima_forecast_values_1[-100:]
conf_int_last_50 = conf_int_1[-100:]

# Create a range of values for the x-axis
x_axis_last_50 = range(len(y_test_last_50))

# Plot forecast results against actual data for the last 50 points
plt.figure(figsize=(12, 6))
plt.plot(x_axis_last_50, y_test_last_50, label='Actual Test Data', color='blue', linewidth=2)
plt.plot(x_axis_last_50, forecast_values_last_50, label='ARIMA(5,1,1) Forecast', color='red', linewidth=2)
plt.fill_between(x_axis_last_50, 
                 conf_int_last_50[:, 0], 
                 conf_int_last_50[:, 1], 
                 color='pink', 
                 alpha=0.3, 
                 label='95% Confidence Interval')

# Set y-axis limits to ignore outliers/extreme values
plt.ylim([0,70]) 
plt.title("Figure 7. ARIMA(5,1,1) Forecast vs Actual Test Data (Last 50 Points)", fontsize=14)
plt.xlabel("Time Step")
plt.ylabel("Values")
plt.legend()
plt.grid(True)
plt.show()

### 2. Random Forest Model

In [None]:
rf = RandomForestRegressor(max_depth=500, random_state=0)
rf.fit(X_train,y_train)
rf_pred = rf.predict(X_test)

# evaluation
rf_mse = mean_squared_error(y_test, rf_pred)
rf_mae = mean_absolute_error(y_test, rf_pred)
rf_rmse = np.sqrt(mean_squared_error(y_test, rf_pred))
rf_mape = mean_absolute_percentage_error(y_test, rf_pred)
rf_r2 = r2_score(y_test, rf_pred)

print("MSE:", rf_mse) 
print("MAE:", rf_mae) 
print("RMSE:", rf_rmse) 
print("MAPE:", rf_mape) 
print("R-squared (Random Forest):", rf_r2)

In [None]:
# Get the last 50 points of y_test and the corresponding forecasted values and confidence interval
y_test_last_50 = y_test[-100:]
forecast_values_last_50 = rf_pred[-100:]
#conf_int_last_50 = conf_int_1[-100:]

# Create a range of values for the x-axis
x_axis_last_50 = range(len(y_test_last_50))

# Plot forecast results against actual data for the last 50 points
plt.figure(figsize=(12, 6))
plt.plot(x_axis_last_50, y_test_last_50, label='Actual Test Data', color='blue', linewidth=2)
plt.plot(x_axis_last_50, forecast_values_last_50, label='Random Forest Regressor', color='red', linewidth=2)

# Set y-axis limits to ignore outliers/extreme values
plt.ylim([0,70])
plt.title("Figure 8. Random Forest Regressor vs Actual Test Data (Last 50 Points)", fontsize=14)
plt.xlabel("Time Step")
plt.ylabel("Values")
plt.legend()
plt.grid(True)
plt.show()

In [None]:
# rf = RandomForestRegressor(random_state=0)

# # Define parameter grid 
# param_grid = {
#     'n_estimators': [100, 200, 300],
#     'max_depth': [10, 20, 30, None],
#     'min_samples_split': [2, 5, 10],
#     'min_samples_leaf': [1, 2, 4],
#     'max_features': ['auto', 'sqrt', 'log2']
# }

# # Create GridSearchCV object with cross-validation
# grid_search = GridSearchCV(estimator=rf, param_grid=param_grid, 
#                            scoring='neg_mean_squared_error', cv=5, n_jobs=-1)

# grid_search.fit(X_valid, y_train)
# cv_results = grid_search.cv_results_
# best_params = grid_search.best_params_

# # # Plot result from cv_results
# # mean_test_scores = cv_results['mean_test_score']
# # max_depths = [param['max_depth'] for param in cv_results['params']]
# # n_estimators = [param['n_estimators'] for param in cv_results['params']]

# # mean_test_scores_grid = np.array(mean_test_scores).reshape(len(max_depths), len(n_estimators))

# # Create a heatmap to visualize MSE
# plt.figure(figsize=(10, 8))
# plt.imshow(mean_test_scores_grid, cmap='viridis', origin='lower')
# plt.colorbar(label='Mean Squared Error (MSE)')
# plt.xticks(np.arange(len(n_estimators)), n_estimators, rotation=45)
# plt.yticks(np.arange(len(max_depths)), max_depths)
# plt.xlabel('Number of Estimators (n_estimators)')
# plt.ylabel('Maximum Depth (max_depth)')
# plt.title('Grid Search Results for Random Forest Hyperparameters')
# plt.show()

# print("Best Hyperparameters:", best_params)

### 3. SVR Model

In [None]:
svr_model = SVR()
svr_model.fit(X_train, y_train)
svr_pred = svr_model.predict(X_test)

# evaluation
svr_mse = mean_squared_error(y_test, svr_pred)
svr_mae = mean_absolute_error(y_test, svr_pred)
svr_rmse = np.sqrt(mean_squared_error(y_test, svr_pred))
svr_mape = mean_absolute_percentage_error(y_test, svr_pred)
svr_r2 = r2_score(y_test, svr_pred)

print("MSE:", svr_mse) 
print("MAE:", svr_mae) 
print("RMSE:", svr_rmse) 
print("MAPE:", svr_mape) 

### 4. Linear SVR Model

In [6]:
linear_svr = LinearSVR(loss='epsilon_insensitive', random_state=0)
linear_svr.fit(X_train, y_train)
linear_svr_pred = linear_svr.predict(X_test)

linear_svr_mse = mean_squared_error(y_test, linear_svr_pred)
linear_svr_mae = mean_absolute_error(y_test, linear_svr_pred)
linear_svr_rmse = np.sqrt(mean_squared_error(y_test, linear_svr_pred))
linear_svr_mape = mean_absolute_percentage_error(y_test, linear_svr_pred)
linear_svr_r2 = r2_score(y_test, linear_svr_pred)

print("MSE:", linear_svr_mse) 
print("MAE:", linear_svr_mae) 
print("RMSE:", linear_svr_rmse) 
print("MAPE:", linear_svr_mape) 
print("R-squared:", linear_svr_r2)

MSE: 53.25887128277389
MAE: 1.1404649475441762
RMSE: 7.297867584628669
MAPE: 0.1311646933757877
R-squared: 0.997204954161102




#### Hyperparameter Tuning for Linear SVR 

In [None]:
svr = LinearSVR() 
params = {
    'C': [0.01, 0.1, 1],
    'epsilon': [0.01, 0.1, 0.5],
    'loss': ['epsilon_insensitive', 'squared_epsilon_insensitive'],
    'max_iter': [10000, 50000, 100000],
    'tol': [0.01, 0.05, 0.1]
}

# Grid Search cross-validation
grid_search = GridSearchCV(svr, params, cv=5, scoring='neg_mean_squared_error')
grid_search.fit(X_valid, y_valid)
best_svr = grid_search.best_estimator_

# Print the best params
print("Best Parameters:", grid_search.best_params_)



In [None]:
# Make predictions with the best estimator
best_svr_pred = best_svr.predict(X_test)

In [None]:
# Evaluate the best estimator
best_svr_mse = mean_squared_error(y_test, best_svr_pred)
best_svr_mae = mean_absolute_error(y_test, best_svr_pred)
best_svr_rmse = np.sqrt(mean_squared_error(y_test, best_svr_pred))
best_svr_mape = mean_absolute_percentage_error(y_test, best_svr_pred)
best_svr_r2 = r2_score(y_test, best_svr_pred)

# Print the evaluation metrics
print("MSE:", best_svr_mse) 
print("MAE:", best_svr_mae) 
print("RMSE:", best_svr_rmse) 
print("MAPE:", best_svr_mape) 
print("R-squared:", best_svr_r2)

**Observation:**

The tuned LinearSVR model demonstrates outperforms compared to the default model in stock price prediction, as evidenced by its improved metrics across the board. Specifically, the tuned model shows a significant reduction in Mean Squared Error (MSE) from 53.259 to 39.157, indicating a more accurate prediction with fewer errors. The Mean Absolute Error (MAE) is almost halved, decreasing from 1.140 to 0.618, which reflects a more precise average prediction error. Furthermore, the Root Mean Squared Error (RMSE) is reduced from 7.298 to 6.258, demonstrating the tuned model's enhanced capability in handling large errors. The Mean Absolute Percentage Error (MAPE) also shows notable improvement, dropping from 13.12% to 9.15%, which implies better accuracy in terms of relative prediction error. 

Based on these metrics, the tuned LinearSVR model is the better choice as it shows significant improvements in prediction accuracy and error reduction. This makes it more reliable for stock price prediction, especially in a field where precision is crucial. However, it is important to consider the model's performance on unseen data to ensure its generalizability.

In [None]:
# Initialize the LinearSVR model with the best parameters
best_svr = LinearSVR(C=grid_search.best_params_['C'],
                      epsilon=grid_search.best_params_['epsilon'],
                      loss=grid_search.best_params_['loss'],
                      max_iter=grid_search.best_params_['max_iter'],
                      tol=grid_search.best_params_['tol'])

best_svr.fit(X_train, y_train)

# Make predictions with the final model
final_pred = best_svr.predict(X_test)

# Evaluate the final model
mse = mean_squared_error(y_test, final_pred)
mae = mean_absolute_error(y_test, final_pred)
rmse = np.sqrt(mean_squared_error(y_test, final_pred))
mape = mean_absolute_percentage_error(y_test, final_pred)
r2 = r2_score(y_test, final_pred)

# Print the evaluation metrics
print("Final Model MSE:", mse) 
print("Final Model MAE:", mae) 
print("Final Model RMSE:", rmse) 
print("Final Model MAPE:", mape) 
print("Final Model R-squared:", r2)


### 5. XGBoost Model

In [None]:
# Baseline XGB Regressor
xgb_reg = xgb.XGBRegressor(objective='reg:squarederror', n_estimators=100)
xgb_reg.fit(X_train, y_train)
xgb_pred = xgb_reg.predict(X_test)

xgb_mse = mean_squared_error(y_test, xgb_pred)
xgb_mae = mean_absolute_error(y_test, xgb_pred)
xgb_rmse = np.sqrt(mean_squared_error(y_test, xgb_pred))
xgb_mape = mean_absolute_percentage_error(y_test, xgb_pred)

print("MSE:", xgb_mse) 
print("MAE:", xgb_mae) 
print("RMSE:", xgb_rmse) 
print("MAPE:", xgb_mape) 

#### Regularization 

Regularization is a technique used to prevent overfitting by penalizing models with extreme coefficient values. XGBoost offers two parameters for regularization [6]:

`lambda`: L2 regularization term on weights, also known as reg_lambda. It's used to help prevent overfitting by adding a penalty for larger weights in the model.

`alpha`: L1 regularization term on weights, also known as reg_alpha. It encourages sparsity, meaning it can set some weight coefficients in the model to zero.

In [None]:
xgb_reg = xgb.XGBRegressor(
    objective='reg:squarederror',
    n_estimators=100,
    seed=0,
    reg_lambda=1,  
    reg_alpha=1    
)

xgb_reg.fit(X_train, y_train)
xgb_pred = xgb_reg.predict(X_test)

xgb_mse = mean_squared_error(y_test, xgb_pred)
xgb_mae = mean_absolute_error(y_test, xgb_pred)
xgb_rmse = np.sqrt(mean_squared_error(y_test, xgb_pred))
xgb_mape = mean_absolute_percentage_error(y_test, xgb_pred)
print("MSE:", xgb_mse) 
print("MAE:", xgb_mae) 
print("RMSE:", xgb_rmse) 
print("MAPE:", xgb_mape) 

In [None]:
# Plot feature importance
xgb.plot_importance(xgb_reg)
plt.title("Figure 11. Feature Importance in XGBoost Regression after Regularization")
plt.show()

In [None]:
# Initialize model
xgb_reg = xgb.XGBRegressor()

# Define the parameter grid
param_grid = {
    'n_estimators': [100, 200, 300, 400],
    'learning_rate': [0.01, 0.1, 0.2, 0.4],
    'reg_alpha': [1e-5, 1e-2, 0.1, 1, 100],
    'reg_lambda': [1e-5, 1e-2, 0.1, 1, 100]
}

# Initialize grid search
grid_search = GridSearchCV(estimator=xgb_reg, 
                           param_grid=param_grid, 
                           cv=5, 
                           scoring='neg_mean_squared_error'
                          )

grid_search.fit(X_valid, y_valid)
print("Best parameters found: ", grid_search.best_params_)

In [None]:
# # Train the final model with the best parameters
# best_xgb = xgb.XGBRegressor(**grid_search.best_params_)
# best_xgb.fit(X_train, y_train)
# xgb_pred = final_model.predict(X_test)

# # Evaluate
# xgb_mse = mean_squared_error(y_test, xgb_pred)
# xgb_mae = mean_absolute_error(y_test, xgb_pred)
# xgb_rmse = np.sqrt(mean_squared_error(y_test, xgb_pred))
# xgb_mape = mean_absolute_percentage_error(y_test, xgb_pred)
# print(f"Tuned XGBoost for Regression MSE:", xgb_mse) 
# print(f"Tuned XGBoost for Regression MAE:", xgb_mae) 
# print(f"Tuned XGBoost for Regression RMSE:", xgb_rmse) 
# print(f"Tuned XGBoost for Regression MAPE:", xgb_mape) 

### 5. LSTM Model

In [None]:
# # Training LSTM model
# class LSTMModel:
#     def __init__(self, input_size, lstm_units=50, epochs=50, batch_size=32):
#         self.input_size = input_size
#         self.lstm_units = lstm_units
#         self.epochs = epochs
#         self.batch_size = batch_size
#         self.model = self.build_model()

#     def lstm_model(self):
#         model = Sequential()
#         model.add(LSTM(units=self.lstm_units, input_size=self.input_size))
#         model.add(Dense(units=1))
#         model.compile(optimizer='adam', loss='mean_squared_error')
#         return model

#     def train(self, X_train, y_train):
#         X_train = np.reshape(X_train.values, (X_train.shape[0], 1, X_train.shape[1]))
#         self.model.fit(X_train, y_train, epochs=self.epochs, batch_size=self.batch_size)

#     def predict(self, X_test):
#         X_test = np.reshape(X_test.values, (X_test.shape[0], 1, X_test.shape[1]))
#         return self.model.predict(X_test)

#     def evaluate(self, y_test, predictions):
#         mse = mean_squared_error(y_test, predictions)
#         return mse
    
#     def shutdown_cluster(self):
#         if self.cluster is not None:
#             self.cluster.shutdown() 

# predictor = LSTMModel(input_size=(1, X_train.shape[1]))
# predictor.train(X_train, y_train)
# predictions = predictor.predict(X_test)
# lstm_mse = predictor.evaluate(y_test, predictions)
# print(f"MSE of LSTM: {lstm_mse}")

In [None]:
# Extract numpy arrays from DataFrames
X_train_values = X_train.values
X_test_values = X_test.values
X_valid_values = X_valid.values

# Reshape input data to 3D [samples, timesteps, features] so it'll fit the LSTM layer
# Use a timestep of 1.
X_train_reshaped = np.reshape(X_train_values, (X_train_values.shape[0], 1, X_train_values.shape[1]))
X_test_reshaped = np.reshape(X_test_values, (X_test_values.shape[0], 1, X_test_values.shape[1]))
X_valid_reshaped = np.reshape(X_valid_values, (X_valid_values.shape[0], 1, X_valid_values.shape[1]))

# LSTM model
lstm = Sequential()
lstm.add(LSTM(50, return_sequences=True, input_shape=(X_train_reshaped.shape[1], X_train_reshaped.shape[2])))
lstm.add(Dropout(0.2))
lstm.add(LSTM(50, return_sequences=False))
lstm.add(Dropout(0.2))
lstm.add(Dense(1))  # Prediction of the next closing value

# Compile the model
lstm.compile(optimizer='adam', loss='mean_squared_error')

# Train the model
lstm.fit(X_train_reshaped, y_train, epochs=50, batch_size=32, validation_data=(X_valid_reshaped, y_valid), verbose=1)

# Predicting and inverse transforming the predictions 
y_pred = lstm.predict(X_test_reshaped)
y_pred = scaler.inverse_transform(y_pred)

# Calculate MSE for evaluation
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error: {mse}")

## Financial News Sentiment Analysis

In [None]:
!pip install git+https://github.com/huggingface/transformers
!pip install transformers
!pip install tensorflow_probability==0.12.2
!pip install contractions
!pip install emoji
!pip install emoticon_fix
!pip install -U accelerate

In [None]:
# Use a pipeline as a high-level helper
# from transformers import pipeline
# pipe_sentiment = pipeline("text-classification", model="mrm8488/distilroberta-finetuned-financial-news-sentiment-analysis")
import contractions
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, PunktSentenceTokenizer
from nltk.stem import PorterStemmer, SnowballStemmer, LancasterStemmer, WordNetLemmatizer
import requests
from bs4 import BeautifulSoup
nltk.download('stopwords', quiet=True)
nltk.download('wordnet', quiet=True)
nltk.download('punkt', quiet=True)
# nltk.download('omw-1.4')
import re
!pip install spacy
import spacy
import string
import emoji
from emoticon_fix import emoticon_fix
from nltk.corpus import stopwords

In [None]:
from google.colab import drive
drive.mount('/content/Mydrive')

In [None]:
df1=pd.read_csv('/content/Mydrive/MyDrive/raw_partner_headlines.csv')

In [None]:
df1_copy = df1.copy()

### 1. Pre-trained Hugging Face Model

In [None]:
# Use a pipeline as a high-level helper
from transformers import pipeline

pipe_sentiment = pipeline("text-classification", model="mrm8488/distilroberta-finetuned-financial-news-sentiment-analysis")

In [None]:
df1 = df1[(df1['date'] > '2017-01-01') & (df1['date'] < '2017-12-31')]
df1 = df1.dropna(axis=0)
df1.drop_duplicates(subset='headline', keep='first', inplace=True)

In [None]:
df1.columns
df1 = df1.dropna(axis=0)
df1.drop_duplicates(subset='headline', keep='first', inplace=True)
df1.reset_index(drop=True)
df1 = df1.reset_index()
df1

In [None]:
def extract_paragraphs_from_url(url):
    import requests
    from bs4 import BeautifulSoup
    # Send an HTTP GET request to the URL
    response = requests.get(url)

    if response.status_code == 200:
        # Parse the HTML content using BeautifulSoup
        soup = BeautifulSoup(response.text, 'html.parser')

        # Find all the <p> elements and extract their text
        paragraphs = [p.get_text() for p in soup.find_all('p')]

        # Join the paragraphs into a single text
        extracted_text = '\n'.join(paragraphs)

        return extracted_text
    else:
        flag=1
        return flag

# Example usage
url = "https://seekingalpha.com/article/4133429-agilent-technologies-good-business-total-return-pricey?source=partner_benzinga"  # Replace with the URL you want to extract data from
extracted_data = extract_paragraphs_from_url(url)

if extracted_data:
    # Print or store the extracted data as a paragraph
    print(extracted_data)

In [None]:
def clean_data(text, lang):
    # remove HTML
    soup = BeautifulSoup(text, 'lxml')
    text = soup.get_text()
    #Replace Emoticon/Emoji with Text
    text = emoji.demojize(text, language = lang )
    text = emoticon_fix.emoticon_fix(text)
    #6Decoding of abbreviations
#     text = abbr_conversion(text)
    # remove mentions
    text = re.sub("@[A-Za-z0-9]+","", text)
    # remove hashtags
    text = re.sub("#[A-Za-z0-9_]+","", text)
    # remove links
    text = re.sub('https:\/\/\S+', '', text)
    # remove punctuation
    text = re.sub('[%s]' % re.escape(string.punctuation), ' ', text)
    # remove next line
    text = re.sub(r'[^ \w\.]', '', text)
    # remove words containing numbers
    text = re.sub('\w*\d\w*', '', text)
    return text

In [None]:
cleaned_data = clean_data(extracted_data,'eng')
cleaned_data

In [None]:
df1["Labels"] = np.nan
df1["Scores"] = np.nan

In [None]:
df1.columns
df1 = df1.drop(['index','Unnamed: 0'],axis=1)

In [None]:
# # start_time = time.time()
# for i in range(0,len(df1)):
#     try:
#         extracted_data = extract_paragraphs_from_url(df1['url'][i])
#         if extracted_data == 1:
#             extracted_data = clean_data(df1['headline'][i],'eng')
#             sentiment1 = pipe_sentiment(extracted_data)
#             df1['Labels'][i] = sentiment1[0]['label']
#             df1['Scores'][i] = sentiment1[0]['score']
#         else:
#             cleaned_data = clean_data(extracted_data,'eng')
#             sentiment2 = pipe_sentiment(cleaned_data,truncation=True)
#             df1['Labels'][i] = sentiment2[0]['label']
#             df1['Scores'][i] = sentiment2[0]['score']
#         if i%100==0:
#             df1.to_csv('Set.csv')

#     except:
#         pass
# # print("--- %s seconds ---" % (time.time() - start_time))

In [None]:
# df1.to_csv('Final_ouput.csv')

In [None]:
# new_df = pd.read_csv('/content/Mydrive/MyDrive/Final_ouput.csv')

In [None]:
# new_df = new_df.dropna(axis=0)

In [None]:
# new_df.to_csv("SentimentAnalysis.csv")

### 2. Naive Bayes Model

In [None]:
!pip install -U spacy==3.*
!python -m spacy download en_core_web_sm
!python -m spacy info

In [None]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, PunktSentenceTokenizer
from nltk.stem import PorterStemmer, SnowballStemmer, LancasterStemmer, WordNetLemmatizer

nltk.download('stopwords', quiet=True)
nltk.download('wordnet', quiet=True)
nltk.download('punkt', quiet=True)
from nltk.corpus import stopwords
import collections
import spacy
from sklearn.preprocessing import LabelBinarizer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics.pairwise import cosine_distances
from sklearn.metrics import RocCurveDisplay
import matplotlib.pyplot as plt
from scipy.sparse import coo_matrix
from sklearn.decomposition import PCA

In [None]:
data = pd.read_csv('/content/Mydrive/MyDrive/SentimentAnalysis.csv')

In [None]:
data_copy = data.copy()

In [None]:
nlp = spacy.load('en_core_web_sm')

In [None]:
nlp.pipe_names

In [None]:
nlp = spacy.blank('en')

# There should be no pipeline components.
nlp.pipe_names

In [None]:
def spacy_tokenizer(doc):
    return [t.text for t in nlp(doc) if \
            not t.is_punct and \
            not t.is_space and \
            t.is_alpha]

In [None]:
encoder = LabelEncoder()
labels = data['Labels'].values
encoded_labels = encoder.fit_transform(labels)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(data['headline'], encoded_labels, stratify = encoded_labels,train_size=0.2)

In [None]:
%%time
vectorizer = TfidfVectorizer(tokenizer=spacy_tokenizer)
train_feature_vects = vectorizer.fit_transform(X_train)

#### Train Set

In [None]:
nb_classifier = MultinomialNB()
nb_classifier.fit(train_feature_vects, y_train)
nb_classifier.get_params()

In [None]:
train_preds = nb_classifier.predict(train_feature_vects)

In [None]:
train_preds_score = nb_classifier.predict_proba(train_feature_vects)

In [None]:
print('Accuracy score on  train set: {}'.format(metrics.accuracy_score(y_train, train_preds)))

In [None]:
print('F1 score on train set: {}'.format(metrics.f1_score(y_train, train_preds, average='weighted')))

In [None]:
print('Precision score on train set: {}'.format(metrics.precision_score(y_train, train_preds, average='weighted')))

In [None]:
print('Recall score on train set: {}'.format(metrics.recall_score(y_train, train_preds, average='weighted')))

In [None]:
print('ROC-AUC score on train set: {}'.format(metrics.roc_auc_score(y_train,train_preds_score, average='weighted',multi_class='ovr')))

#### Test Set

In [None]:
test_feature_vects = vectorizer.transform(X_test)

In [None]:
test_preds = nb_classifier.predict(test_feature_vects)

In [None]:
test_preds_score = nb_classifier.predict_proba(test_feature_vects)

In [None]:
print('Accuracy score on test set: {}'.format(metrics.accuracy_score(y_test, test_preds)))

In [None]:
print('F1 score on test set: {}'.format(metrics.f1_score(y_test, test_preds, average='weighted')))

In [None]:
print('Precision score on test set: {}'.format(metrics.precision_score(y_test, test_preds, average='weighted')))

In [None]:
print('Recall score on test set: {}'.format(metrics.recall_score(y_test, test_preds, average='weighted')))

In [None]:
print('ROC-AUC score on test set: {}'.format(metrics.roc_auc_score(y_test, test_preds_score, average='weighted',multi_class='ovr')))

In [None]:
cr_NB = metrics.classification_report(y_test, test_preds)
print("\n\nClassification Report\n")
print(cr_NB)

In [None]:
cm_NB = metrics.confusion_matrix(y_test, test_preds)
print("Confusion Matrix\n")
print(cm_NB)

In [None]:
sns.heatmap(cm_NB, annot=True,cmap='Blues')

##### This part of the code is commented which was made to run once for collecting csv with aggregated results of labels and scores

In [None]:
# df1_copy['Labels'] = np.nan
# df1_copy['Scores'] = np.nan

In [None]:
# for i in range(10,len(X_train)):
#   j=df1_copy.index[df1_copy['headline']==X_train.iloc[i]].tolist()[0]
#   df1_copy['Labels'][j] = train_preds[i]
#   df1_copy['Scores'][j] = np.max(train_preds_score[i])

In [None]:
# for i in range(0,len(X_test)):
#   j=df1_copy.index[df1_copy['headline']==X_test.iloc[i]].tolist()[0]
#   df1_copy['Labels'][j] = test_preds[i]
#   df1_copy['Scores'][j] = np.max(test_preds_score[i])

In [None]:
# df1.to_csv('Final-NB.csv')

### 3. KNN Model

In [None]:
from sklearn.metrics.pairwise import cosine_distances
model_knn = KNeighborsClassifier(n_neighbors=22,metric='cosine',weights='distance')

#### Train Set

In [None]:
model_knn.fit(train_feature_vects,y_train)

In [None]:
pred_train_knn = model_knn.predict(train_feature_vects)

In [None]:
pred_train_scores_knn = model_knn.predict_proba(train_feature_vects)

In [None]:
print('Accuracy score on train set: {}'.format(metrics.accuracy_score(y_train,pred_train_knn)))

In [None]:
print('F1 score on train set: {}'.format(metrics.f1_score(y_train,pred_train_knn, average='weighted')))

In [None]:
print('Precision on train set: {}'.format(metrics.precision_score(y_train,pred_train_knn, average='weighted')))

In [None]:
print('Recall on train set: {}'.format(metrics.recall_score(y_train,pred_train_knn, average='weighted')))

In [None]:
print('ROC-AUC score on train set: {}'.format(metrics.roc_auc_score(y_train,pred_train_scores_knn, average='weighted',multi_class='ovr')))

#### Test Set

In [None]:
pred_test_knn = model_knn.predict(test_feature_vects)

In [None]:
pred_test_scores_knn = model_knn.predict_proba(test_feature_vects)

In [None]:
print('Accuracy score on test set: {}'.format(metrics.accuracy_score(y_test,pred_test_knn)))

In [None]:
print('F1 score on test set: {}'.format(metrics.f1_score(y_test,pred_test_knn, average='weighted')))

In [None]:
print('Precision on test set: {}'.format(metrics.precision_score(y_test,pred_test_knn, average='weighted')))

In [None]:
print('Recall on test set: {}'.format(metrics.recall_score(y_test,pred_test_knn, average='weighted')))

In [None]:
print('AUC score on test set: {}'.format(metrics.roc_auc_score(y_test,pred_test_scores_knn, average='weighted',multi_class='ovr')))

In [None]:
cr_KNN = metrics.classification_report(y_test,pred_test_knn)
print("\n\nClassification Report\n")
print(cr_KNN)

In [None]:
cm_KNN = metrics.confusion_matrix(y_test,pred_test_knn)
print("Confusion Matrix\n")
print(cm_KNN)

In [None]:
sns.heatmap(cm_KNN, annot=True,cmap='Blues')

In [None]:
# df1_copy['Labels'] = np.nan
# df1_copy['Scores'] = np.nan

In [None]:
# for i in range(10,len(X_train)):
#   j=df1_copy.index[df1_copy['headline']==X_train.iloc[i]].tolist()[0]
#   df1_copy['Labels'][j] = pred_train_knn[i]
#   df1_copy['Scores'][j] = np.max(pred_train_scores_knn[i])

In [None]:
# for i in range(0,len(X_test)):
#   j=df1_copy.index[df1_copy['headline']==X_test.iloc[i]].tolist()[0]
#   df1_copy['Labels'][j] = pred_test[i]
#   df1_copy['Scores'][j] = np.max(pred_test_scores_knn[i])

In [None]:
# df1.to_csv('Final-KNN.csv')

### Graphs

#### Naive Bayes

In [None]:
label_binarizer1 = LabelBinarizer().fit(y_train)
y_onehot_test = label_binarizer1.transform(y_test)

In [None]:
label_binarizer1.transform([1])

In [None]:
class_of_interest0 = 0
class_of_interest1 = 1
class_of_interest2= 2
class_id0 = np.flatnonzero(label_binarizer1.classes_ == 0)[0]
class_id1 = np.flatnonzero(label_binarizer1.classes_ == 1)[0]
class_id2 = np.flatnonzero(label_binarizer1.classes_ == 2)[0]

In [None]:
RocCurveDisplay.from_predictions(
    y_onehot_test[:, class_id0],
    test_preds_score[:, class_id0],
    name=f"{class_of_interest0} vs the rest",
    color="darkorange"
)
plt.axis("square")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("One-vs-Rest ROC curves:\nNegative vs (Neutral& Positive)")
plt.legend()
plt.show()

RocCurveDisplay.from_predictions(
    y_onehot_test[:, class_id1],
    test_preds_score[:, class_id1],
    name=f"{class_of_interest1} vs the rest",
    color="green"
)
plt.axis("square")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("One-vs-Rest ROC curves:\nNeutral vs (Negative& Positive)")
plt.legend()
plt.show()

RocCurveDisplay.from_predictions(
    y_onehot_test[:, class_id2],
    test_preds_score[:, class_id2],
    name=f"{class_of_interest2} vs the rest",
    color="blue"

)
plt.axis("square")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("One-vs-Rest ROC curves:\nPositive vs (Negative& Neutral)")
plt.legend()
plt.show()


In [None]:
RocCurveDisplay.from_predictions(
    y_onehot_test.ravel(),
    test_preds_score.ravel(),
    name="micro-average OvR",
    color="darkorange"
)
plt.axis("square")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("Micro-averaged One-vs-Rest\nReceiver Operating Characteristic")
plt.legend()
plt.show()

#### KNN

In [None]:
RocCurveDisplay.from_predictions(
    y_onehot_test[:, class_id0],
    pred_test_scores_knn[:, class_id0],
    name=f"{class_of_interest0} vs the rest",
    color="darkorange"
)
plt.axis("square")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("One-vs-Rest ROC curves:\nNegative vs (Neutral& Positive)")
plt.legend()
plt.show()

RocCurveDisplay.from_predictions(
    y_onehot_test[:, class_id1],
    pred_test_scores_knn[:, class_id1],
    name=f"{class_of_interest1} vs the rest",
    color="green"
)
plt.axis("square")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("One-vs-Rest ROC curves:\nNeutral vs (Negative& Positive)")
plt.legend()
plt.show()

RocCurveDisplay.from_predictions(
    y_onehot_test[:, class_id2],
    pred_test_scores_knn[:, class_id2],
    name=f"{class_of_interest2} vs the rest",
    color="blue"

)
plt.axis("square")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("One-vs-Rest ROC curves:\nPositive vs (Negative& Neutral)")
plt.legend()
plt.show()


In [None]:
RocCurveDisplay.from_predictions(
    y_onehot_test.ravel(),
    pred_test_scores_knn.ravel(),
    name="micro-average OvR",
    color="darkorange"
)
plt.axis("square")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("Micro-averaged One-vs-Rest\nReceiver Operating Characteristic")
plt.legend()
plt.show()

#### Comparison of predictions between HuggingFace,Naive Bayes and KNN

In [None]:
pca = PCA(n_components=2)
dense_matrix_test = test_feature_vects.toarray()
pca_x_test = pca.fit_transform(dense_matrix_test)

In [None]:
plt.subplot(1, 3, 1)
new_list_x=[]
new_list_y=[]
for j in range(0,len(y_test)):
    if y_test[j]==0:
        new_list_x.append(pca_x_test[j][0])
        new_list_y.append(pca_x_test[j][1])
        label_str = 'Negative'
plt.scatter(new_list_x,new_list_y,label=f'{label_str}', c=np.random.rand(3,), edgecolor='k',s=40)   
plt.legend()
plt.xlabel("Features reduced",fontsize=10)
plt.ylabel("Features reduced",fontsize=10)
plt.title("Negative HF")


plt.subplot(1,3,2)
for j in range(0,len(test_preds)):
    if y_test[j]==0:
        new_list_x.append(pca_x_test[j][0])
        new_list_y.append(pca_x_test[j][1])
        label_str = 'Negative'
plt.scatter(new_list_x,new_list_y,label=f'{label_str}', c=np.random.rand(3,),edgecolor='k',s=30)   
plt.legend()
plt.title("Negative NB")


plt.subplot(1,3,3)
for j in range(0,len(pred_test_knn)):
    if y_test[j]==0:
        new_list_x.append(pca_x_test[j][0])
        new_list_y.append(pca_x_test[j][1])
        label_str = 'Negative'
plt.scatter(new_list_x,new_list_y,label=f'{label_str}', c=np.random.rand(3,),edgecolor='k',s=30)   
plt.legend()
plt.title("Negative KNN")
plt.show()

In [None]:
plt.subplot(1, 3, 1)
new_list_x=[]
new_list_y=[]
for j in range(0,len(y_test)):
    if y_test[j]==1:
        new_list_x.append(pca_x_test[j][0])
        new_list_y.append(pca_x_test[j][1])
        label_str = 'Neutral'
plt.scatter(new_list_x,new_list_y,label=f'{label_str}', c=np.random.rand(3,), edgecolor='k',s=40)   
plt.legend()
plt.xlabel("Features reduced",fontsize=10)
plt.ylabel("Features reduced",fontsize=10)
plt.title("Neutral HF")


plt.subplot(1,3,2)
for j in range(0,len(test_preds)):
    if y_test[j]==1:
        new_list_x.append(pca_x_test[j][0])
        new_list_y.append(pca_x_test[j][1])
        label_str = 'Neutral'
plt.scatter(new_list_x,new_list_y,label=f'{label_str}', c=np.random.rand(3,),edgecolor='k',s=30)   
plt.legend()
plt.title("Neutral NB")

plt.subplot(1,3,3)
for j in range(0,len(pred_test_knn)):
    if y_test[j]==1:
        new_list_x.append(pca_x_test[j][0])
        new_list_y.append(pca_x_test[j][1])
        label_str = 'Neutral'
plt.scatter(new_list_x,new_list_y,label=f'{label_str}', c=np.random.rand(3,),edgecolor='k',s=30)   
plt.legend()
plt.title("Neutral KNN")
plt.show()

In [None]:
plt.subplot(1, 3, 1)
new_list_x=[]
new_list_y=[]
for j in range(0,len(y_test)):
    if y_test[j]==2:
        new_list_x.append(pca_x_test[j][0])
        new_list_y.append(pca_x_test[j][1])
        label_str = 'Positive'
plt.scatter(new_list_x,new_list_y,label=f'{label_str}', c=np.random.rand(3,), edgecolor='k',s=40)   
plt.legend()
plt.xlabel("Features reduced",fontsize=10)
plt.ylabel("Features reduced",fontsize=10)
plt.title("Positive HF")


plt.subplot(1,3,2)
for j in range(0,len(test_preds)):
    if y_test[j]==2:
        new_list_x.append(pca_x_test[j][0])
        new_list_y.append(pca_x_test[j][1])
        label_str = 'Positive'
plt.scatter(new_list_x,new_list_y,label=f'{label_str}', c=np.random.rand(3,),edgecolor='k',s=30)   
plt.legend()
plt.title("Positive NB")

plt.subplot(1,3,3)
for j in range(0,len(pred_test_knn)):
    if y_test[j]==2:
        new_list_x.append(pca_x_test[j][0])
        new_list_y.append(pca_x_test[j][1])
        label_str = 'Positive'
plt.scatter(new_list_x,new_list_y,label=f'{label_str}', c=np.random.rand(3,),edgecolor='k',s=30)   
plt.legend()
plt.title("Positive KNN")
plt.show()

## Content-based Collaborative Filtering (PHASE-3)

In [None]:
sentiment_df = pd.read_csv('Final-KNN.csv')
print(sentiment_df.columns)

# Drop unwanted columns
unwanted_columns = ['Unnamed: 0.1', 'index', 'Unnamed: 0', 'headline', 'url', 'publisher']
sentiment_df = sentiment_df.drop(columns=unwanted_columns)
sentiment_df.head()

# Extract only yyyy-mm-dd
sentiment_df['date'] = pd.to_datetime(sentiment_df['date']).dt.date

# Sort df by 'ticker', 'date', and 'Scores' in descending order
sentiment_df = sentiment_df.rename(columns={'stock': 'ticker'})
sentiment_df = sentiment_df.sort_values(by=['ticker', 'date', 'Scores'], ascending=[True, True, False])

# Get the highest score for each unique 'ticker' and 'date' combination
sentiment_df = sentiment_df.drop_duplicates(subset=['ticker', 'date'], keep='first')
sentiment_df = sentiment_df.reset_index(drop=True)
sentiment_df.columns = sentiment_df.columns.str.lower()
print(sentiment_df.head())

In [None]:
stock_df = pd.read_csv('pca_stock_result.csv')
stock_df['Date'] = pd.to_datetime(stock_df['Date'])
stock_df.set_index('Date', inplace=False)
# Rename columns to lowercase
stock_df = stock_df.rename(columns={
    'Feature_1': 'feature_1',
    'Feature_2': 'feature_2',
    'Feature_3': 'feature_3',
    'Feature_4': 'feature_4',
    'Feature_5': 'feature_5',
    'Date': 'date',
    'Ticker': 'ticker',
    'Close': 'close'
})
stock_df.head()

### Feature Representation

In [None]:
# Verify the tickers match between data sets
stock_tickers = stock_df['ticker'].unique()
sentiment_tickers = sentiment_df['ticker'].unique()
set(stock_tickers) == set(sentiment_tickers)

In [None]:
stock_tickers = set(stock_df['ticker'].unique())
sentiment_tickers = set(sentiment_df['ticker'].unique())

missing_in_stock = stock_tickers.difference(sentiment_tickers)
missing_in_sentiment = sentiment_tickers.difference(stock_tickers)

print("Tickers missing in stock_df but present in sentiment_df:", missing_in_stock)
print("Tickers missing in sentiment_df but present in stock_df:", missing_in_sentiment)

In [None]:
# Find tickers in sentiment_df that are not in sentiment_df and delete those tickers from stock_df
missing_tickers = set(stock_tickers) - set(sentiment_tickers)
stock_df = stock_df[~stock_df['ticker'].isin(missing_tickers)]
stock_df = stock_df.reset_index(drop=True)

In [None]:
# Perform an left merge on 'ticker' and 'date' columns, filling missing sentiment with NaN
sentiment_df['date'] = pd.to_datetime(sentiment_df['date'])
stock_df['date'] = pd.to_datetime(stock_df['date'])
product_df = pd.merge(stock_df, sentiment_df, on=['ticker', 'date'], how='left')
print(product_df.head())

In [None]:
# Fill NaN values in the 'sentiment' column with forward filling
product_df.sort_values(by=['ticker', 'date'], inplace=True)
product_df['labels'] = merged_df.groupby('ticker')['labels'].fillna(method='ffill')
product_df['scores'] = merged_df.groupby('ticker')['scores'].fillna(method='ffill')
product_df.head()

In [None]:
product_df['labels'].isnull().sum() 

Since we use `filling forward` to fill missing `labels` and `scores`, meanwhile, in the very beginning of that ticker, if there was no data for sentiment, these missing values remaining NAN. We decide to remove 21,553 missing values. 

In [None]:
product_df = product_df.dropna()
product_df = product_df.reset_index()
product_df.head()

In [None]:
product_df['date'] = pd.to_datetime(product_df['date'])
# Label encoding for 'ticker'
# label_encoder = LabelEncoder()
# merged_df['ticker_encoded'] = label_encoder.fit_transform(merged_df['ticker'])

In [None]:
product_df.drop(['index'], axis=1, inplace=True)

In [None]:
product_df

In [None]:
test_predict_df = product_df.copy()

In [None]:
test_predict_df.to_csv("Data-Price-Pred-Testing.csv")

In [None]:
product_df.drop(['ticker'], axis=1, inplace=True)
product_df.drop(['date'], axis=1, inplace=True)

In [None]:
# Convert to numpy array
ticker_features = np.array(product_df)

In [None]:
product_df

### Similarity Matrix
Calculate the similarity between stocks based on the PCA-transformed features and embedding sentiment features.

In [None]:
# Calculate cosine similarity matrix
cosine_sim_matrix = cosine_similarity(ticker_features[:, 1:], ticker_features[:, 1:])

In [None]:
# Calculate linear kernel 
kernel_matrix = linear_kernel(ticker_features[:, 1:], ticker_features[:, 1:])

### Candidate Generation
Rank stocks based on their similarity to the user's profile vector, as discussed earlier. Stocks with higher similarity scores are recommended.

In [None]:
cosine_sim_matrix = pd.read_csv("Cosine-similarity.csv")
cosine_sim_matrix.drop(columns=['Unnamed: 0'], inplace=True)
kernel_matrix = pd.read_csv("kernel-matrix.csv")
kernel_matrix.drop(columns=['Unnamed: 0'], inplace=True)

In [None]:
cosine_sim_matrix.head()

In [None]:
cosine_sim_matrix = cosine_sim_matrix.to_numpy()
cosine_sim_matrix

In [None]:
kernel_matrix = kernel_matrix.to_numpy()
kernel_matrix

In [None]:
type(cosine_sim_matrix)

In [None]:
# Output columns
candidate_info = merged_df.reset_index(drop=True)
titles = merged_df[['Product', 'Ingredients', 'Rating']]
indices = pd.Series(df_cont.index, index=df_cont['Product'])

**Complete work in a function with pre-computed similarity matrix**

In [None]:
def generate_recommendation(ticker, top_n):
    """
    Given a stock ticker, recommend the top n most similar stocks.
    """
    """
    Parameters:
    ticker: Stock ticker symbol
    top_n: Number of recommendations to return
    ------------
    Returns: List of top n recommended stock ticker symbols
    """
    # Validate inputs
    if top_n < 1:
        print("Invalid top_n, must be >= 1")
        return

    # Check if ticker exists
    if ticker not in product_df['ticker'].values:
        print(f"Ticker {ticker} not found in data")
        return

    # Get index of ticker in similarity matrix
    idx = list(product_df['ticker'].unique()).index(ticker)

    # Lookup similarity scores for this ticker
    sim_scores = similarity_matrix[idx]
  
    # Sort by similarity and take top n
    sort_idx = np.argsort(sim_scores)[-1:-(top_n+1):-1]
    top_tickers = merged_df['ticker'].unique()[sort_idx]
  
    return top_tickers

In [None]:
# Testing
print()
user_input = input(f"Enter ticker to get recommendations: ")
top_n = input(f"Insert number of your list (must be integer >= 1): ")
generate_recommendation(user_input, top_n)

## Reference:

    [1] StandardScaler, MinMaxScaler and RobustScaler techniques - ML. (2020, July 15). GeeksforGeeks. https://www.geeksforgeeks.org/standardscaler-minmaxscaler-and-robustscaler-techniques-ml/
    
    
    [2] https://towardsdatascience.com/why-does-stationarity-matter-in-time-series-analysis-e2fb7be74454

    [3] Zheng, X., & XIONG, N. (2022). Stock price prediction based on PCA-LSTM model. https://doi.org/10.1145/3545839.3545852

    [5] https://developers.google.com/machine-learning/recommendation/content-based/basics
    
    [6] Gulli, A. (2016, March 26). Complete guide to parameter tuning in XGBoost (with codes in Python). Analytics Vidhya. https://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python/#:~:text=XGBoost%20provides%20L1%20and%20L2,the%20objective%20function%20during%20training.

    [7] https://huggingface.co/mrm8488/distilroberta-finetuned-financial-news-sentiment-analysis?library=true

    [8] github.com/MD-Ryhan/NLP-Preprocesing/blob/main/NLP_Preprocessing.ipynbanva.com/download/mac/

    [9] https://huggingface.co/mrm8488/distilroberta-finetuned-financial-news-sentiment-analysis?text=Morgan+Asset+Management+announces+the+liquidations+and+dissolving+six+exchange-traded+funds.+The+funds+include+among+them+the+popular+but+low-yielding+Diversire+Return+Europe+equity+ETF+Jed%2C+Long+Term+Short+ET+Jed+%2C+JPMorgan+Long+Shadow+ET+JPeu+%2C+JPMorgan+Commodity+Fund+J&library=true

    [10] https://huggingface.co/pszemraj/led-large-book-summary?text=PR+Newswire+NEW+YORK+++May+++++NEW+YORK+++May+++++++PRNewswire+++++++J+++P+++Morgan+Asset+Management+today+announced+the+upcoming+liquidation+and+dissolution+of+six+exchange+traded+funds+++JPMorgan+Diversified+Return+Europe+Equity+ETF+++JPEU+++++JPMorgan+Long+++Short+ETF+++JPLS+++++JPMorgan+Managed+Futures+Strategy+ETF+++JPMF+++++JPMorgan+Diversified+Return+Global+Equity+ETF+++JPGE+++++JPMorgan+Diversified+Alternatives+ETF+++JPHF+++++and+JPMorgan+Event+Driven+ETF+++JPED+++++collectively+++the+++Funds+++++++Shareholders+of+the+Funds+may+sell+their+holdings+of+each+Fund+on+NYSE+Arca+++Inc+++++++NYSE+Arca+++++until+market+close+on+the+designated+last+day+of+trading+++transaction+fees+from+their+broker+dealer+may+be+incurred+++++ETF+Name+Ticker+Last+Day+of+Trading+Liquidation+Date+JPMorgan+Diversified+Return+Europe+Equity+ETF+JPEU+++++++++++JPMorgan+Long+++Short+ETF+JPLS+++++++++++JPMorgan+Managed+Futures+Strategy+ETF+JPMF+++++++++++JPMorgan+Diversified+Return+Global+Equity+ETF+JPGE+++++++++++JPMorgan+Diversified+Alternatives+ETF+JPHF+++++++++++JPMorgan+Event+Driven+ETF+JPED+++++++++++Shares+of+JPEU+++JPLS+and+JPMF+will+stop+accepting+creation+orders+from+authorized+participants+after+the+close+on+June+++++++and+will+be+delisted+ahead+of+market+open+on+June+++++++Additionally+++shares+of+JPGE+++JPHF+++and+JPED+will+stop+accepting+creation+orders+from+authorized+participants+after+the+close+on+June+++++++and+will+be+delisted+ahead+of+market+open+on+June+++++++Shareholders+who+continue+to+hold+shares+of+any+of+the+Funds+on+the+Funds+++designated+aforementioned+liquidation+date+will+receive+a+liquidating+distribution+of+cash+in+the+cash+portion+of+their+brokerage+accounts+equal+to+the+amount+of+the+net+asset+value+of+their+shares+++++We+regularly+monitor+and+evaluate+our+product+lineup+as+market+and+economic+conditions+evolve+++++said+Bryon+Lake+++Head+of+Americas+ETF+for+J+++P+++Morgan+Asset+Management+++++This+process+allows+us+to+optimize+and+scale+our+product+offerings+to+better+meet+client+objectives+and+market+demand+++++Shareholders+who+receive+a+liquidating+distribution+generally+will+recognize+a+capital+gain+or+loss+equal+to+the+amount+received+for+their+shares+over+their+adjusted+basis+in+such+shares+if+shares+are+held+in+taxable+account+++and+should+consult+their+tax+advisor+about+the+potential+tax+consequences+++About+J+++P+++Morgan+Asset+Management+J+++P+++Morgan+Asset+Management+++with+assets+under+management+of+USD+++trillion+++as+of++March++++++is+a+global+leader+in+investment+management+++J+++P+++Morgan+Asset+Management+s+clients+include+institutions+++retail+investors+and+high+net+worth+individuals+in+every+major+market+throughout+the+world+++J+++P+++Morgan+Asset+Management+offers+global+investment+management+in+equities+++fixed+income+++real+estate+++hedge+funds+++private+equity+and+liquidity+++JPMorgan+Chase+++Co+++++NYSE+++JPM+++is+a+leading+global+financial+services+firm+with+assets+of+USD+++trillion+++as+of++December++++and+operations+worldwide+++J+++P+++Morgan+Asset+Management+is+the+marketing+name+for+the+asset+management+businesses+of+JPMorgan+Chase+++Co+++and+its+affiliates+worldwide+++J+++P+++Morgan+ETFs+are+distributed+by+JPMorgan+Distribution+Services+++Inc+++++which+is+an+affiliate+of+JPMorgan+Chase+++Co+++Affiliates+of+JPMorgan+Chase+++Co+++receive+fees+for+providing+various+services+to+the+funds+++JPMorgan+Distribution+Services+++Inc+++is+a+member+of+FINRA+++Investors+should+carefully+consider+the+investment+objectives+and+risks+as+well+as+charges+and+expenses+of+an+ETF+before+investing+++The+summary+and+full+prospectuses+contain+this+and+other+information+about+the+ETF+and+should+be+read+carefully+before+investing+++To+obtain+a+prospectus+++Call+++++++ETF+++NOT+FDIC+INSURED+++NO+BANK+GUARANTEE+++MAY+LOSE+VALUE+View+original+content+++http+++www+prnewswire+com+news+releases+jp+morgan+asset+management+announces+liquidation+of+six+exchange+traded+funds++html+SOURCE+J+++P+++Morgan+Asset+Management+We+d+love+to+learn+more+about+your+experiences+on+GuruFocus+com+and+how+we+can+improve

    [11] https://wandb.ai/ivangoncharov/FinBERT_Sentiment_Analysis_Project/reports/Financial-Sentiment-Analysis-on-Stock-Market-Headlines-With-FinBERT-HuggingFace--VmlldzoxMDQ4NjM0#:~:text=Financial%20news%20headlines%20are%20a,positive%2C%20negative%2C%20and%20neutral

    [12] https://huggingface.co/docs/transformers/main_classes/pipelines

    [13] https://huggingface.co/models?pipeline_tag=summarization&sort=trending

    [14] https://www.analyticsvidhya.com/blog/2022/03/building-naive-bayes-classifier-from-scratch-to-perform-sentiment-analysis/

    [15] https://colab.research.google.com/github/futuremojo/nlp-demystified/blob/main/notebooks/nlpdemystified_classification_naive_bayes.ipynb#scrollTo=uJ2lYTdW3HXP

    [16] https://www.analyticsvidhya.com/blog/2022/03/building-naive-bayes-classifier-from-scratch-to-perform-sentiment-analysis/

    [17] https://www.kaggle.com/code/carlosaguayo/text-clustering-with-unsupervised-learning

    [18] https://www.kaggle.com/code/barishasdemir/classification-with-naive-bayes

