## This notebook was made too predict sotck prices with traditional machine learning algorithms and deep learning

### To run this notebook ensure that you have already the dataset provided by kaggle -> daily-historical-stock-prices-1970-2018

> ensure that you have the folder : 
1. /kaggle/input/daily-historical-stock-prices-1970-2018/historical_stock_prices.csv
2. /kaggle/input/daily-historical-stock-prices-1970-2018/historical_stocks.csv

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt #ploting graphics

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

Starting by visualizing data;

In [None]:
missing_values = ["n/a", "na", "--"]

stocks = pd.read_csv('../input/daily-historical-stock-prices-1970-2018/historical_stocks.csv',na_values = missing_values)

Let's start with a small preview of the corresponding dataframe read previously;

In [None]:
stocks.head()

In [None]:
stocks.columns

In [None]:
stocks.describe()

We realize there are 5 columns and this dataset:

- 'ticker' corresponds to the name of the share
- 'exchange' corresponds to the type of exchange made 
- 'name' refers the company's name
- 'sector' refers to the actual sector where the given company operates
- 'industry' specifies the type of services that can be provided

We also know that this dataset contains missing values :

> We have missing values on columns 'sector' and 'industry'


In [None]:
stocks.shape

In [None]:
stocks['ticker'].unique().size

> We realize there are 6460 entries to the table, where the unique identifiers are the share's names, meaning that a company's name can show up twice if it has, throughout the established period of time, changed the name of it's stocks. We will note that, for having to change the type of exchange, the companies also changed the name of the shares;

> One example of this is:

In [None]:
stocks[stocks['name'] == "1347 PROPERTY INSURANCE HOLDINGS, INC."]

## Missing values treatment

#### The first step is to identify if the companies switched their share name; in the case they did, they can then contain the sector and industry present on another row

> Right now we have the following missing values:

1. ticker         0
2. exchange       0
3. name           0
4. sector      1440
5. industry    1440

> In a 6459 rows × 5 columns matrix


> We want all rows that present null values, so we can obtain the names of the companies that do.

In [None]:
null_data = stocks[stocks.isnull().any(axis=1)]
null_data

> We realize that, by standard, all rows that do not have sector, do not have industry either and vice-versa.

### This function checks for companies that changed their ticker name

> If any did, we check if there are some other instances of that same company where the sector and industry information is present.

In [None]:
pd.options.mode.chained_assignment = None
names = null_data['name'].unique()

for companie in names:
    
    data = stocks[stocks['name'] == companie]
    
    for index,row in data.iterrows():
        
        if(not pd.isnull(row['sector'])):
            
            sector = row['sector']
            industry = row['industry']
            
            tmp = stocks[stocks['name'] == row['name']]
            tmp["sector"] = tmp["sector"].fillna(sector)
            tmp['industry'] = tmp['industry'].fillna(industry)
            stocks[stocks['name'] == row['name']] = tmp


In [None]:
stocks.isnull().sum()

> After this operation, we remain with a 6459 rows × 5 columns matrix, but we have different numbers of missing values, respectively:

1. ticker         0
2. exchange       0
3. name           0
4. sector      1018
5. industry    1018
6. dtype: int64

### In this part, all remaining missing values will be removed from the dataset, mainly for the reason that there is not sufficient information that allows us to fill these values, given the variety of sector and industries existant.
> We now have a 5442 rows × 5 columns matrix

In [None]:
bad_tickers = stocks[stocks.isnull().any(axis=1)]


stocks = stocks.dropna(how='any',axis=0) 
stocks.isnull().sum()

> Now we only have 5442 tickers

In [None]:
stocks['ticker'].unique().size

# Dataset exploration

> Predominant sectors

> Predominant industries

> Types of stock exchanges on which we operate

In [None]:
stocks['name'].unique().size

In [None]:
stocks['exchange'].value_counts()

The number of shares in each type of exchange is rather balanced, which is good for the purpose of ML.

In [None]:
stocks['exchange'].value_counts().plot(kind='bar', title='Types of exchanges')

In [None]:
stocks['sector'].value_counts()

In [None]:
stocks = stocks[stocks['sector'] != 'SECTOR']
stocks.shape
#remover a linha dummy

In [None]:
stocks['sector'].value_counts().plot(kind='barh', title='Sectors')

We realize the, throughout the dataset, the Finances sector dominates the sector column, up there with Consumer services and Health care. We can also consider technology, if we allow such leverage.

In [None]:
ax=stocks['sector'].value_counts().plot(kind='pie', title='Sectors', )
ax.set_ylabel('')

##  Industry

In [None]:
stocks['industry'].value_counts()

In [None]:
absolute_frequency_top10 = stocks['industry'].value_counts()[:10].copy()
absolute_frequency_top10 = absolute_frequency_top10.rename('')
absolute_frequency_top10.plot(kind='barh')

We can observe two major industries ruling the dataset by a considerable margin: Major Pharmaceuticals and Major Banks.

In [None]:
absolute_frequency_top10.plot(kind='pie')

## Let's dive down on the rate of changing in terms of share name.
### As we said earlier, a company with the same name can have several shares names changed through time.

In [None]:
dif_exchange_x_ticker_exchange = stocks.groupby(['name','sector','industry'])['ticker'].agg(ticker_exchange=('ticker','count'), exchange=('exchange','count'))
dif_exchange_x_ticker= stocks.groupby(['name','sector','industry'])['ticker'].agg(ticker_exchange=('ticker','count'))
change_on = dif_exchange_x_ticker[dif_exchange_x_ticker['ticker_exchange'] >=2].sort_values(by='ticker_exchange', ascending=False).apply(lambda x : x-1)
change_off = dif_exchange_x_ticker[dif_exchange_x_ticker['ticker_exchange'] < 2].apply(lambda x : x-1)

In [None]:
dif_exchange_x_ticker_exchange

In [None]:
dif_exchange_x_ticker_exchange[dif_exchange_x_ticker_exchange['exchange'] == dif_exchange_x_ticker_exchange['ticker_exchange']].count()

By the operation above, we see that the number of ticker names and types of exchange are the same, which tells us, by knowing that there are not equal tickers in the dataset, that change to/from either type of exchange requires rebranding of the share, so we can simply identify it by the ticker, like we did priviously. We may also consider that the ticker name might be changed for marketing purposes and maintaining it's type of exchange.

It is worth noting that we will not include the type of exchange as we consider that it doesn't add information: **for now**,we do not have dates of these changes nor the info regarding if the shares still exist or not, so we cannot identify if there's some sort of shitf in favor of any of the types of exchange. So we won't be using type of exchange on the rest of this topic.

In [None]:
change_off

In [None]:
change_on

We produced two distinct dataframes, grouped by the companies' names, sectors and industries,respectively, and contemplate the number of changes on ticker:
> The first one regards to the companies that did not make any changes, going by the name 'change_off'

> The second one regards to the companies that made name changes in the past, going by the name 'change_on'


In [None]:
ax=change_on.groupby('sector').mean().sort_values(by='ticker_exchange',ascending=False).plot(kind='barh',y='ticker_exchange',legend=False, title ='Sectors')
#ax.set_ylabel('')

We identify that companies on the Finances sector show a mean higher regarding the rebranding of their tickers, followed closely by Transportation and Consumer Services

In [None]:
ax=change_on.groupby('industry').mean().sort_values(by='ticker_exchange',ascending=False)[:10].plot(kind='barh',y='ticker_exchange',legend=False, title ='Top 10 Industries')
#ax.set_ylabel('')

Industry wise, we can see that Finances show up big again, equal with Farming/Seeds/Milling. Investment Bankers/Brokers/Service follow right away.

We conclude the the shallow analysis of the first component of the Stock Prices dataset.

# Let's now explore the stock prices

In [None]:
stock_prices = pd.read_csv('../input/daily-historical-stock-prices-1970-2018/historical_stock_prices.csv')
stock_prices

Short explanation about the dataset

- 'ticker' corresponds to the name of the share
- 'open' describe the open price of that share in a specific day
- 'close' describe the final share price in the end of a day
- 'adj-close' it´s a tricky column, describes the ajudsted price of a share, thats normally different from the close price
 > An example of this, is when a stock splits occur. 
A stock split it's a current way used for compannies to sell more stocks, by diving the price in (x), lets say x = 2, then if one share = 10€, then, when stock split occurs, let say with a split=2, the share is equal to 5€, but in the end this 2 shares represent the same as 1 share, e.g, imagine that the companny have 10 shares, if you buy 1 share you have 1% of the company, in a stock split(split=2), if you buy 2 shares you only have 1% of the comapnny two.

 > A stock's price is typically affected by supply and demand of market participants. However, some corporate actions, such as stock splits, dividends / distributions and rights offerings can affect a stock's price and adjustments are needed to arrive at a technically accurate reflection of the true value of that stock.

- 'low' is the lowest value paid for that share
- 'high' is the highest value paid for that share
- 'volume' of shares purchased in that day
- 'date' represents the date (year-month-day)

As we can notice by the table above, this dataset doesn't contain any missing value

In [None]:
stock_prices.describe()

In [None]:
stock_prices["ticker"].unique().size

> As mentioned earlier, the ticker is the key, is this case, a ticker represents a companny.

> In this dataset we have 5685 different tickers, in the previous dataset we had 5441 tickers, so we have to eliminate some tickers here.

> This opperation is quite long, maybe 5 minutes.
> Meanwhile go get a coffee

## Não correr esta célula por agora

In [None]:
'''
pd.options.mode.chained_assignment = None
tickers = bad_tickers['ticker'].unique()

ind = []

for index, row in stock_prices.iterrows():
    if(row['ticker'] in tickers):
        ind.append(index)
        
ind = np.asarray(ind)
stock_prices.drop(ind)
    

stock_prices["ticker"].unique().size
'''

# Correlation analysis

In [None]:
stock_prices.corr()

>As we can expect, the adj_close isn't very corrolated with any feature.

>Este trabalho serve para conseguir prever os preços de stock, até nos dias de hoje esta tarefa é quase impossível devido ao enorme conjunto de fatores que podem fazer variar este ajudsted close poder variar. Neste dataset contemos os features mais comuns e que se obtêm facilmente, como podemos imaginar é muito dificil ter todas as features relativas à mudança de preços no stock de uma empresa devido à falta de informação, quanto mais para 5685 empresas.

>Iremos fazer o nosso melhor utilizando estes features para prever o ajd_close price.

## Let's check the rate of volumes and the mean,variance and standart deviation of the ajd_close price

### As we can notice, the ajudsted close price is the key to a good trader. We want to build serveral machine learning algorithms that can accurately predict this column.

In [None]:
stock_adj_mean = stock_prices.groupby(['ticker'])['volume'].agg(Volume_of_trades=('Volume','sum'), Count_Ticker=('ticker','count'),Mean_adj_close =('adj_close', 'mean'))
stock_adj_var = stock_prices.groupby(['ticker'])['volume'].agg(Var_adj_close =('adj_close', 'var'))
stock_adj_std = stock_prices.groupby(['ticker'])['volume'].agg(Std_adj_close =('adj_close', 'std'))


stocks_adj_close = stock_adj_mean.join(stock_adj_var, how = 'left', lsuffix = 'ticker').join(stock_adj_std, how = 'left', lsuffix = 'ticker')

In [None]:
stocks_adj_close

## As we can see by the table, some comapanies appear more than others.
### We will try to predict the adjusted close price for the companies how have more tickets/rows on the dataset, by other words, we are choosing the companies with more data, for more accuract predictions

In [None]:
stocks_adj_close.describe()

In [None]:
stocks_adj_close_top_10 = stocks_adj_close['Count_Ticker'].sort_values(ascending = False)[:10].copy()
stocks_adj_close_top_10 = stocks_adj_close_top_10.rename('')
stocks_adj_close_top_10

## We want to predict the stock prices to a specific company.
### So the first step it's to make a dataframe with a unique ticker.

# Let's start by making an algorithm to predict the ajudsted close price of the HPQ

In [None]:
def companny_stocks(ticker):
    return stock_prices[stock_prices["ticker"] == ticker]

df = companny_stocks("HPQ")
df = df.drop(["ticker"],axis=1)
df[60:80]

> Data corresponding to the company with the HPQ ticker
> All rows are ordered by date, we can see that there are some hops, for example from 1970-04-24 to 1970-04-27, as we can see in the last two rows of this dataset, but these hops are derived from the wekends, where the trading market is closed and sometimes holidays.

In [None]:
df = df.drop(["date"],axis=1)

### This companie have the NYSE Exhange ad is Sector is Technology

In [None]:
stocks[stocks['ticker'] == 'HPQ']

### Some thoughs:

1. Make a short explanation about the columns -> Done

2. Check whats top10 tickers how have more volumes/trades on the market and take some concluisions about their sector/industry so we can have the first part of the dataset exploration consitent

3. Explore the Volume and Bets Column

4. Separate the stocks by ticker

4. Check if the date is consistente


# Machine Learning algorithms

1. Linear Regression
2. Support Vectors Classifers for Regression SVR
3. Decisions Tree / Random Forest
4. Recurrent Neural Networks (RNN) / LSTM



> First lets prepare the small dataset corresponding to the comapny HPQ:

> The adj_close price column it's the label column and the rest of the features are independent variables

> Lets normalize all the variables, and then make a x_train and y_train


In [None]:
def show_distribution(df):
    plt.figure(figsize=(70, 3))
    i = 0
    
    for feature in df:
        plt.subplot(1, 20, i+1)
        plt.plot(df[feature])
        plt.xlabel(feature)
        i += 1
    plt.show()

show_distribution(df)

>As we can see, all these distributions are very similar, but none of them is close to a normal distribution, so we will normalize the data using the min-max normalizer.

# Normalizing and splitting the data

In [None]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
scaler.fit(df) 


df_normalized = scaler.transform(df)

x = df_normalized[:,[0,1,3,4,5]]
y = df_normalized[:,[2]]

In [None]:
plt.plot(df_normalized[:,5])
plt.xlabel("volume")
plt.show()

# Spliting the data

In [None]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(
      x, y, test_size = 0.2, random_state=2
    )

In [None]:
y_train.shape


# Linear Regression

In [None]:
from sklearn import linear_model

regr_model = linear_model.LinearRegression(normalize = True)
regr_model = regr_model.fit(x_train, y_train)

print("Coefficient:" ,regr_model.coef_)

y_pred = regr_model.predict(x_test) 

print("Valores previstos: " , regr_model)
print("Valores previstos: " , y_pred)
print("Valores reais: " , y_test)

print("Score: " , regr_model.score(x_train,y_train))



> compare the actual output values for **X_test** with the predicted

In [None]:
result = pd.DataFrame({'Actual': y_test.flatten(), 'Predicted': y_pred.flatten()})

result

>comparison result as a bar graph using the below script

>**Note**: As the number of records is huge, for representation purpose we use just 30 records.

In [None]:
result1 = result.head(30)
result1.plot(kind='bar',figsize=(16,10))
plt.grid(which='major', linestyle='-', linewidth='0.5', color='green')
plt.grid(which='minor', linestyle=':', linewidth='0.5', color='black')
plt.show()

In [None]:
from sklearn import metrics
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))  
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))  
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
