# stock_list.csv file

Before looking at the main file, **stock_prices**, it can be interesting to look at **stock_list**. We want to get familiar with the population of stocks we have to deal with before diving into their price time series.

In [None]:
import pandas as pd
import numpy as np
df=pd.read_csv('../input/jpx-tokyo-stock-exchange-prediction/stock_list.csv')
df.dtypes

In [None]:
print(df.shape)
print(df['SecuritiesCode'].nunique())

there are 4417 distinct securities, each line in the file stock_list is a security. There is no redundancy and no missing security code.


# Market Capitalization

First we want to look at the market capitalisation distribution. Because there are totally different values, with several factors of 10 difference, We will look at the **log10** distribution.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
sns.displot(np.log10(df['MarketCapitalization']))
plt.show()

This gives the number of digits after the first digit.
Most market capitalizations are between 100 Million (8 zeroes) and 10 Trillion (13 zeroes) Yen.

# Sectors

There are 4 columns dealing with the sectors: 33SectorCode, 17SectorCode, 33SectorName, 17SectorName. Let's try to understand the difference between 17 and 33.

In [None]:
sns.countplot(data=df,y='33SectorName',order = df['33SectorName'].value_counts().index)
plt.rcParams['figure.figsize']=(15,15)
plt.show()

In [None]:
print('There are {} sector33 and {} sector17'.format(df['33SectorCode'].nunique(),df['17SectorCode'].nunique()))

In [None]:
print(sorted(df['17SectorCode'].unique()))

In [None]:
print(sorted(df['33SectorCode'].unique()))

Sector33 has 33 codes and a dummy value "-".
Sector 17 has 17 values and a dummy value "-".

What is the relation between Sector33 and Sector17 ? are they connected or independant ?

In [None]:
df_sector=df[['17SectorCode','33SectorCode']].drop_duplicates()
print(df_sector.shape)
print(df_sector.sort_values(by='17SectorCode'))


We can see that a value of **33SectorCode** always appears with the same value for **17SectorCode**.  **33SectorCode** is a **subclassification** of **17SectorCode**. They are not independant classifications, 33SectorCode is **finer** and 17SectorCode is **coarser**.

This is consistent with the link provided https://www.jpx.co.jp/english/markets/indices/line-up/files/e_fac_13_sector.pdf

In [None]:

fig,ax = plt.subplots(nrows=2,ncols=1,figsize=(15,15))


sns.countplot(data=df,y='17SectorCode',ax=ax[0],order = df['17SectorCode'].value_counts().index)
sns.countplot(data=df,y='33SectorCode',ax=ax[1],order = df['33SectorCode'].value_counts().index)

plt.tight_layout()
plt.show()

The sectors are not evenly distributed. some sectors have a lot more securities.

# Section/Products

Let's look at the column Section/Products now.

In [None]:
sns.countplot(data=df,y='Section/Products',order = df['Section/Products'].value_counts().index)
plt.show()

In [None]:
print('There are {} Section/Products'.format(df['Section/Products'].nunique()))

The Section/Products are not evenly distributed. **First Section (Domestic)** has the most securities.

We would now like to investigate the dummy sector **'-'**.

In [None]:

df_dummy_sector = df.loc[df['33SectorName']=='-']
print('{} securities belong to dummy sector.'.format(df_dummy_sector['SecuritiesCode'].nunique()))

In [None]:
sns.countplot(data=df_dummy_sector,y='Section/Products',order = df_dummy_sector['Section/Products'].value_counts().index)
plt.show()

We can see that a large share of the dummy sector are the ETFs/ETNs.
Conversely:

In [None]:
df_ETFs_ETNs = df.loc[df['Section/Products']=='ETFs/ ETNs']
print('{} securities are ETFs/ETNs.'.format(df_ETFs_ETNs['SecuritiesCode'].nunique()))

In [None]:
sns.countplot(data=df_ETFs_ETNs,y='33SectorName',order = df_ETFs_ETNs['33SectorName'].value_counts().index)
plt.show()

All the ETFs/ETNs are in dummy sector.

# Are all the securities from stock_list included in the competition ?

As we have seen, there are **4417 securities** in the stock_list file. Are they all included in the competition ?

In [None]:
df_prices=pd.read_csv('../input/jpx-tokyo-stock-exchange-prediction/train_files/stock_prices.csv')
print(df_prices.shape)
print(df_prices.dtypes)

In [None]:
print('There are {} securities in stock_prices'.format(df_prices['SecuritiesCode'].nunique()))

Out of 4417 securities in stock_list, only 2000 are included in the competition.

In [None]:
print('Security codes in stock_prices. Lowest: {}, Highest: {}, Number: {}'.
      format(df_prices['SecuritiesCode'].min(),
             df_prices['SecuritiesCode'].max(),
            df_prices['SecuritiesCode'].nunique()))

print('Security codes in stock_list. Lowest: {}, Highest: {}, Number: {}'.
      format(df['SecuritiesCode'].min(),
             df['SecuritiesCode'].max(),
            df['SecuritiesCode'].nunique()))
