## Machine Learning Group Project 2

<font size="3">
<div style="text-align: justify">
To run clustering using K-means algorithm on companies traded in Russell 2000 index with 9 clusters, we firstly retrieve stock classification details from the Bloomberg Terminal and Morningstar for stocks information. Here's what we did.
<br>
    
We imported all  necessary libraries and then read the stock names from an Excel file. For each stock on our list, we crafted a search URL tailored for the Morningstar platform. Then, we fetched the search page content, combing through its structure to pinpoint the unique link leading to the stock's comprehensive details. 

Once we accessed this detailed page, we were able to capture the stock classification details. These details were then added to our type_list. If we encountered an empty or missing classification, we made a note of it with the placeholder 'this is error'.

Now, we've finished the task of sourcing stock classification details for a select 100 stocks from Morningstar. All our findings are neatly cataloged in the type_list, ready for any subsequent analysis we might undertake.

### Import Libraries

In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
import plotly.express as px
from bs4 import BeautifulSoup
import requests
from tqdm import tqdm
from tqdm import tqdm

### Data Gathering

In [8]:
# Initialization
df=pd.read_csv('data1.csv')
type_list=[]

# Loop through stocks
for stock in df['Name'].tolist():
    try:
        # Construct a search URL on Morningstar using the stock name
        search_page=f'https://www.morningstar.com/search?query={stock}'
        
        # Fetch the search page
        content=requests.get(search_page).text
        
        # Extract stock-specific page link
        soup=BeautifulSoup(content,'html.parser')
        result=soup.findAll('a',attrs={'class':'mdc-link mdc-security-module__name mds-link mds-link--no-underline mdc-link--no-underline'})[0]['href']
        stock_page='https://www.morningstar.com'+result
        
        # Fetch the stock-specific page
        content=requests.get(stock_page).text
        soup=BeautifulSoup(content,'html.parser')
        classification=soup.findAll('span',attrs={'class':"mdc-data-point mdc-data-point--style-box"})
        
        # Extract stock classification
        for i in classification:
            print(i.text.strip())
            if len(i.text.strip())>0:
                type_list.append(i.text.strip())
            else:
                type_list.append('this is error')
    # Error handling
    except:
        type_list.append('this is error')
        continue

  df=pd.read_csv('data1.csv')


Small Blend
Small Value
Small Value
Small Blend
Small Blend
Small Growth
Small Blend
Small Growth
Mid Growth
Small Growth
Small Value
Small Value
Small Growth
Small Growth
Small Value
Small Growth
Small Blend
Small Growth
Small Growth
Small Blend
Small Growth
Small Value
Small Value
Small Growth
Small Blend
Small Growth
Small Blend
Small Growth
Small Growth
Small Value
Small Blend
Small Growth
Small Growth
Small Growth
Small Value
Small Growth
Small Blend
Mid Blend
Small Growth
Small Value
Small Blend
Small Blend
Small Blend
Small Blend
Small Growth
Small Value
Small Value
Small Value
Small Blend
Small Growth
Small Growth
Small Growth
Small Blend
Small Growth
Small Growth
Small Blend
Small Growth
Small Growth
Small Growth
Small Growth
Small Value
Small Blend
Small Blend
Small Growth
Small Blend
Small Blend
Small Blend
Small Blend
Small Growth
Small Value
Small Growth
Small Value
Small Blend
Small Growth
Small Growth
Small Blend
Small Blend
Small Blend
Small Blend
Small Growth
Small Gro

### Data Processing

<font size="3">
<div style="text-align: justify">
To analyze and visualize the dataset stored in 'data.xlsx', we took several steps to ensure its integrity and usability.
<br>
    
Upon the initial phase, we imported data from an Excel file into a DataFrame called df. We identified instances where data was represented as '--'. Recognizing that such entries could hinder subsequent analysis, we replaced these placeholders with NaN, which turned these ambiguous entries into a universally recognized format for missing data.

To ensure numerical computations could be performed, we converted 'CM' values into a numeric format. Then, We stored 'PEG' and 'CM' columns in a new DataFrame named data but some entries in this subset were missing, represented as NaN. We replaced these gaps with zeros. Since negative values within this subset not make sense in our analysis context, we replaced them with zeros.  To ensure the data's scale and distribution, we transformed the values in the 'CM' and 'PEG' columns. To wrap up our endeavor, we visualized the relationship between 'PEG' and 'CM' columns through a scatter plot using the plotly.express library.

In [4]:
# Import the raw data
df = pd.read_excel('data.xlsx')

# Replace all '--' in the DataFrame with NaN values
df.replace('--', np.nan, inplace = True)

# Convert data from string type to numeric type
df['CM'] = pd.to_numeric(df['CM'])

# Process missing values and negative values
# Replaces all missing values (NaN) in the df_exp DataFrame with 0
data = df[['PEG','CM']]
data.fillna(0, inplace = True)
print(data)
# Replaces all values in the DataFrame that are less than 0 with 0
data[data<0] = 0

print(data.info())

# Convert the values in the 'CM' and 'PEG' column
data['CM']= data['CM'].apply(lambda x: np.log(x) if x > 0 else 0)
data['PEG'] = data['PEG'].apply(lambda x:np.log(x) if x>0  else 0)
print(data.describe())

# Using the plotly.express module from the plotly library to create a scatter plot of PEG and CM
fig = px.scatter(data, x='PEG', y='CM', title='PEG_CM')
fig.update_layout(width=800, height=600)
fig.show()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data.fillna(0, inplace = True)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data[data<0] = 0
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data[data<0] = 0
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versu

             PEG            CM
0       0.000000  5.294643e+09
1      21.826249  2.205962e+09
2       0.000000  1.649290e+09
3       0.000000  1.400748e+09
4       0.000000  7.170398e+08
...          ...           ...
1873   57.010907  2.058683e+09
1874   20.708502  3.840849e+08
1875  -28.691321  3.166821e+08
1876  131.021780  9.944043e+08
1877   10.944279  9.795974e+07

[1878 rows x 2 columns]
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1878 entries, 0 to 1877
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   PEG     1878 non-null   float64
 1   CM      1878 non-null   float64
dtypes: float64(2)
memory usage: 29.5 KB
None
               PEG           CM
count  1878.000000  1878.000000
mean      2.588875    20.500456
std       2.226185     1.318989
min      -2.417261     0.000000
25%       0.000000    19.709739
50%       3.099983    20.560858
75%       4.455992    21.346124
max       9.722227    23.459718


### K-means Clustering

<font size="3">
<div style="text-align: justify">
In this code, we are performing cluster analysis on a dataset using the KMeans algorithm and then visualizing the results. 
<br><br>
    
**Initializing the Clustering Algorithm**

* We initialize the KMeans clustering algorithm to create 9 clusters. Next, we again perform KMeans clustering specifically using the columns 'PEG' and 'CM' of the data DataFrame and get the cluster assignments for each data point.

**Assigning Clusters to Data**

* For each row in the data DataFrame, we assign the cluster label it belongs to. This is stored in a new column named 'cluster'. We also extract the cluster centers and store them in the centers variable.
    
**Visualizing the Clusters**

* Using the plotly.express library, we create a scatter plot of the data points, where the x-axis represents the 'PEG' values, the y-axis represents the 'CM' values, and the color of each point signifies its cluster assignment.
The title of the plot is set to "KMeans Clustering", and the label for the color scheme is set as 'Cluster Group'.

In essence, this code segments the dataset into 9 clusters based on the 'PEG' and 'CM' features, and then it provides a visual representation of these clusters along with their centers.

In [5]:
# Initializing clustering algorithm
km_res = KMeans(n_clusters = 9).fit(data)
clusters = km_res.cluster_centers_
kmeans = KMeans(n_clusters=9)
clusters = kmeans.fit_predict(data[['PEG', 'CM']])  

# Assigning Clusters to Data
data['cluster'] = clusters  
centers = kmeans.cluster_centers_

# Visualizing the Clusters
fig = px.scatter(data, x='PEG', y='CM', color='cluster', 
                 title="KMeans Clustering",
                 labels={'cluster': 'Cluster Group'} )
# # 自定义每个簇的颜色
# cluster_colors = ['#FF5733', '#33FF57', '#5733FF', '#FFFF33', '#33FFFF', '#FF33FF', '#FF5733', '#33FF57', '#5733FF']

# # 创建散点图，并设置颜色映射
# fig = px.scatter(data, x='PEG', y='CM', color='cluster', color_discrete_map={i: color for i, color in enumerate(cluster_colors)},
#                  title="KMeans Clustering",
#                  labels={'cluster': 'Cluster Group'})

# Displaying the Visualization
df_centers = pd.DataFrame(centers, columns=['PEG', 'CM'])
for i, row in df_centers.iterrows():
    fig.add_scatter(x=[row['PEG']], y=[row['CM']], mode='markers',
                    marker=dict(symbol='x', size=10, color='black'),
                    showlegend=False,
                    name=f"Center {i + 1}")
    # cluster_label = f"Center {i + 1}"
    # fig.add_annotation(x=row['PEG'], y=row['CM'], text=cluster_label,
    #                    showarrow=True, arrowhead=2, arrowcolor='black',
    #                    font=dict(color='black', size=12))


fig.update_layout(width=1000, height=800)
fig.show()







A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



<font size="3">
<div style="text-align: justify">
Subsequently, we illustrate the distribution of companies based on their ratings from Morningstar to assess the alignment of these classifications with our anticipatory boundaries.

In [6]:
#Appending the 'MorningStar' column to data
data['MorningStar'] = df['MorningStar']

#Identifying and extracting unique values from the 'MorningStar' column in data
labels = data['MorningStar'].unique()

#Visualizing the clusters
fig = px.scatter(data, x='PEG', y='CM', color='MorningStar',
                 title='Scatter plot grouped by MorningStar Ratings',
                 labels={'PEG': 'PEG Values', 'CM': 'CM Values', 'MorningStar': 'MorningStar Ratings'},
                 symbol='MorningStar')  

fig.update_layout(width=1000, height=800)
fig.show()



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



We've selected ten of our "favorite" stocks out of more than 1,700 stocks. The results were only 20% accurate.
The reason for such a low accuracy rate may be due to the inadequate way of data preprocessing and the fact that the variables market capitalization and peg are not sufficient to localize the company's attributes in two dimensions.

In [10]:

peg_list = data['PEG'][100:110]
cm_list = data['CM'][100:110]

new_data = pd.DataFrame({
    'PEG': peg_list,  
    'CM': cm_list   
})


number_to_string = {
    0: 'small value',
    1: 'mid growth',
    2: 'mid value',
    3: 'large value',
    4: '-',
    5: 'small growth',
    6: 'small blend',
    7: 'large growth',
    8: 'mid blend'
}

# new_data['predicted_cluster'] = kmeans.predict(new_data[['PEG', 'CM']])
alist = kmeans.predict(new_data[['PEG', 'CM']])
def get_map(alist):
    res = []
    for num in alist:
        res.append(number_to_string.get(num, 'Invalid Number'))
    new_data['predicted_cluster'] = res

    return 
get_map(alist)
# new_data['predicted_cluster'] = kmeans.predict(new_data[['PEG', 'CM']])

print(new_data)
print(df['MorningStar'][100:110])

          PEG         CM predicted_cluster
100  0.000000  21.165422        mid growth
101  4.777315  20.366836         mid value
102  5.387413  22.411495      small growth
103  4.271350  22.021259      small growth
104  5.719843  20.524821      large growth
105  5.004133  21.096039      small growth
106  4.526813  21.796627      small growth
107  2.505191  22.057317       small blend
108  3.852020  22.654327      small growth
109  2.664884  21.513392       small blend
100     Small Blend
101    Small Growth
102     Small Blend
103    Small Growth
104     Small Value
105    Small Growth
106     Small Blend
107     Small Value
108     Small Blend
109     Small Blend
Name: MorningStar, dtype: object
