# Stock Market Analysis
In this report we will analyze the stock data. And derive insights from it.
Firstly, we will be cleaning checking if there are any null values or outliers in the data and then clean the data. Then we will be performing real time data analysis that is how the stocks are working with help of several strategies like, stock return, moving average, etc,.
So, lets begin by importing the necessary packages.

In [None]:
import matplotlib.pyplot as plt  #for plotting the graphs
import numpy as np               #for operting on arrays if needed               #for importing the Stock data file
import datetime       #for taking date object
import pandas as pd
import file_import
import call_api
from stocker import Stocker

After importing important files. Let us load data of a particular stock into the pandas dataframe object.

In [None]:
data, lis, file_name = file_import.operate_file() #Here data holds the stock data
up_data = call_api.data_update(data, file_name, lis[0])
data = pd.concat([data, up_data], sort = True)
stock_name = lis[0]
start_date = lis[1]
end_date = lis[2]
data = data.drop(['1. open','2. high','3. low','4. close','5. volume'], axis = 1)
print("Lets us see the first 10 rows of the %s stock\n\n"%(stock_name),data)


Lets get the interval from which the user wants to analyze the data.(In terms of date)

In [None]:
data.drop_duplicates( keep = False, inplace = True) 
x = 0
while(x == 0):
    y,m,d = start_date.split('-',3)
    temp_date = datetime.date(int(y),int(m),int(d))
    
    df = data.loc[start_date : end_date, : ] #Slicing the data from start date to end date
    if(len(df.index) == 0):
        print("Date Interval specified is Invalid \nPlease, Re-Enter the date in Valid format YYYY-MM-DD : \n")
        continue
    x = 1

df.index = pd.to_datetime(df.index) #Converting index to datetime object so that daily, monthly or yearly data could be taken out

<h2>Describing Selected Data</h2>
After selecting the stock data let us describe it. Describe is a inbuilt function in python that is used for describing the data. Describe function returns a table which contain mean, standard deviation, minimum values, etc,. All the values that describe function gives are:
<ul>
   <li>count = Number of elements in that column</li>
   <li>mean = Arithematic mean of the elements in that column</li>
   <li>std = Standard Deviation</li>
   <li>min = Minumum value in that column</li>
   <li>25% = Gives highest value in lower 25% values</li>
   <li>50% = Gives highest values in lower 50% values</li>
   <li>75% = Gives highest values in lower 75% values</li>
   <li>max = Maximum value in that column</li>
    
</ul>

In [None]:
df.describe()

<h2>Simple Line Graph</h2>
Graphs and Charts are one of the best ways of representing a data. They provide much better insights on how any data is moving. Simple line graph of Open, High, Low and Close values of the stocks. Simple line graph doesn't provide much details. But for getting started with visualizing the stock data lets plot it. 

In [None]:
%pylab inline
%matplotlib inline

pylab.rcParams['figure.figsize'] = (15, 9)
pt = df[["Open","Close","High","Low"]].plot(grid = True)
pt.set_xlabel("Year")
pt.set_ylabel("Price")
savefig('graph.png')
plt.show()
%time

<h2>Discovering the Relation between Total Traded Quantity vs Close Price</h2>
Usually, traded quantity increases if the stock price increases or decreases too rapidly on a given day. This parameter is important for our model for prediction. So we should take some time out to identify the relation between them in our data.

In [None]:
from plotnine import ggplot, geom_point, aes, stat_smooth, facet_wrap

(ggplot(df, aes('Close', 'Volume', colour = 'Close'))
 + geom_point()
 + stat_smooth(method='loess')
 )

<h2>Stock's Return</h2>
A “better” solution, though, would be to plot the information we actually want: the stock’s returns. This involves transforming the data into something more useful for our purposes. There are multiple transformations we could apply.

One transformation would be to consider the stock’s return since the beginning of the period of interest. In other words, we plot:

$\text{return}_{t,0} = \frac{\text{price}_t}{\text{price}_0}$

This will require transforming the data in the stocks object, which I do next.

In [None]:
stock_return = df["Close"].apply(lambda x: x / df.Close[0])
stock_return.head()

In [None]:
stock_return.plot(grid = True).axhline(y = 1, color = "black", lw = 2)
plt.show()

<h2>Logarithmic Stock Change</h2>
Logarithmic Stock change can describe volatility of any share in the best way. Here, $\log$ is the natural log, and our definition does not depend as strongly on whether we use<br> 
$\log(\text{price}_{t}) - \log(\text{price}_{t - 1})$<br>
                          or <br>
$\log(\text{price}_{t+1}) - \log(\text{price}_{t}).)$<br> 
The advantage of using log differences is that this difference can be interpreted as the percentage change in a stock but does not depend on the denominator of a fraction

In [None]:
stock_change = df.Close.diff() #Diff()
stock_change.head()

In [None]:
stock_change.plot(grid = True).axhline(y = 0, color = "black", lw = 2)

<h2>Japanese Candlestick Plot</h2>

A linechart is fine, but there are at least four variables involved for each date (open, high, low, and close), and we would like to have some visual way to see all four variables that does not require plotting four separate lines. Financial data is often plotted with a Japanese candlestick plot, so named because it was first created by 18th century Japanese rice traders. Such a chart can be created with matplotlib, though it requires considerable effort.

In [None]:
from matplotlib.dates import DateFormatter, WeekdayLocator,\
    DayLocator, MONDAY
from mpl_finance import candlestick_ohlc
 
def pandas_candlestick_ohlc(dat, stick = "day", otherseries = None):
    #dat: pandas DataFrame object with datetime64 index, and float columns "Open", "High", "Low", and "Close", likely created via DataReader from "yahoo"
    #stick: A string or number indicating the period of time covered by a single candlestick. Valid string inputs include "day", "week", "month", and "year", ("day" default), and any numeric input indicates the number of trading days included in a period
    #otherseries: An iterable that will be coerced into a list, containing the columns of dat that hold other series to be plotted as lines
    #This will show a Japanese candlestick plot for stock data stored in dat, also plotting other series if passed.
    mondays = WeekdayLocator(MONDAY)        # major ticks on the mondays
    alldays = DayLocator()              # minor ticks on the days
    dayFormatter = DateFormatter('%d')      # e.g., 12
 
    # Create a new DataFrame which includes OHLC data for each period specified by stick input
    transdat = dat.loc[:,["Open", "High", "Low", "Close"]]
    if (type(stick) == str):
        if stick == "day":
            plotdat = transdat
            stick = 1 # Used for plotting
        elif stick in ["week", "month", "year"]:
            if stick == "week":
                transdat["week"] = pd.to_datetime(transdat.index).map(lambda x: x.isocalendar()[1]) # Identify weeks
            elif stick == "month":
                transdat["month"] = pd.to_datetime(transdat.index).map(lambda x: x.month) # Identify months
            transdat["year"] = pd.to_datetime(transdat.index).map(lambda x: x.isocalendar()[0]) # Identify years
            grouped = transdat.groupby(list(set(["year",stick]))) # Group by year and other appropriate variable
            plotdat = pd.DataFrame({"Open": [], "High": [], "Low": [], "Close": []}) # Create empty data frame containing what will be plotted
            for name, group in grouped:
                plotdat = plotdat.append(pd.DataFrame({"Open": group.iloc[0,0],
                                            "High": max(group.High),
                                            "Low": min(group.Low),
                                            "Close": group.iloc[-1,3]},
                                           index = [group.index[0]]))
            if stick == "week": stick = 5
            elif stick == "month": stick = 30
            elif stick == "year": stick = 365
 
    elif (type(stick) == int and stick >= 1):
        transdat["stick"] = [np.floor(i / stick) for i in range(len(transdat.index))]
        grouped = transdat.groupby("stick")
        plotdat = pd.DataFrame({"Open": [], "High": [], "Low": [], "Close": []}) # Create empty data frame containing what will be plotted
        for name, group in grouped:
            plotdat = plotdat.append(pd.DataFrame({"Open": group.iloc[0,0],
                                        "High": max(group.High),
                                        "Low": min(group.Low),
                                        "Close": group.iloc[-1,3]},
                                       index = [group.index[0]]))
 
    else:
        raise ValueError('Valid inputs to argument "stick" include the strings "day", "week", "month", "year", or a positive integer')
 
 
    # Set plot parameters, including the axis object ax used for plotting
    fig, ax = plt.subplots()
    fig.subplots_adjust(bottom=0.2)
    if (plotdat.index[-1] - plotdat.index[0] < pd.Timedelta('730 days')):
        weekFormatter = DateFormatter('%b %d')  # e.g., Jan 12
        ax.xaxis.set_major_locator(mondays)
        ax.xaxis.set_minor_locator(alldays)
    else:
        weekFormatter = DateFormatter('%b %d, %Y')
    ax.xaxis.set_major_formatter(weekFormatter)
 
    ax.grid(True)
 
    # Create the candelstick chart
    candlestick_ohlc(ax, list(zip(list(date2num(plotdat.index.tolist())), plotdat["Open"].tolist(), plotdat["High"].tolist(),
                      plotdat["Low"].tolist(), plotdat["Close"].tolist())),
                      colorup = "green", colordown = "red", width = stick * .4)
 
    # Plot other series (such as moving averages) as lines
    if otherseries != None:
        if type(otherseries) != list:
            otherseries = [otherseries]
        dat.loc[:,otherseries].plot(ax = ax, lw = 1.3, grid = True)
 
    ax.xaxis_date()
    ax.autoscale_view()
    plt.setp(plt.gca().get_xticklabels(), rotation=45, horizontalalignment='right')
    plt.show()
 
pandas_candlestick_ohlc(df)

<h2>Moving Averages</h2>

Charts are very useful. In fact, some traders base their strategies almost entirely off charts (these are the “technicians”, since trading strategies based off finding patterns in charts is a part of the trading doctrine known as technical analysis). Let’s now consider how we can find trends in stocks.

A q-day moving average is, for a series x_t and a point in time t, the average of the past $q$ days: that is, if MA^q_t denotes a moving average process, then:

$MA^q_t = \frac{1}{q} \sum_{i = 0}^{q-1} x_{t - i}$

Moving averages smooth a series and helps identify trends. The larger q is, the less responsive a moving average process is to short-term fluctuations in the series x_t. The idea is that moving average processes help identify trends from “noise”. Fast moving averages have smaller q and more closely follow the stock, while slow moving averages have larger q, resulting in them responding less to the fluctuations of the stock and being more stable.

pandas provides functionality for easily computing moving averages. I demonstrate its use by creating a 20-day (one month) moving average for the Apple data, and plotting it alongside the stock.

In [None]:
df["20d"] = np.round(df["Close"].rolling(window = 20, center = False).mean(), 2)
x_date = temp_date + datetime.timedelta(days = 20)
x_date = str(x_date)
pandas_candlestick_ohlc(df.loc[x_date : end_date, : ], otherseries = "20d")

<h2>Multiple Moving Averages</h2>
Traders are usually interested in multiple moving averages, such as the 20-day, 50-day, and 200-day moving averages. It’s easy to examine multiple moving averages at once.

In [None]:
df["50d"] = np.round(df["Close"].rolling(window = 50, center = False).mean(), 2)
df["200d"] = np.round(df["Close"].rolling(window = 200, center = False).mean(), 2)
x_date = temp_date + datetime.timedelta(days = 200)
x_date = str(x_date)
pandas_candlestick_ohlc(df.loc[x_date : end_date, : ], otherseries = ["20d", "50d", "200d"])

In [None]:
stock = Stocker(stock_name, df)

In [None]:
del df['OpenInt']
del df['50d']
del df['200d']

<h1>Testing Stationarity</h1>
Stationarity in any time series helps in checking whether the statistical properties like mean, variance, deviation, etc,. are constant or not. A stationary time series data makes itself easier to analyze and predict the further outcomes

In [None]:
df['RollStd'] = np.round(df["Close"].rolling(window = 20, center = False).std(), 2)
new_df = df.dropna( axis = 0)
pt = new_df[["Close","20d",'RollStd']].plot(grid = True)
pt.set_xlabel("Year")
pt.set_ylabel("Price")
plt.show()

In [None]:
#Performing Dickey-Fuller test:
from statsmodels.tsa.stattools import adfuller

print('Results of Dickey-Fuller Test:')
dftest = adfuller(new_df['Close'], autolag='AIC')
dfoutput = pd.Series(dftest[0:4], index=['Open','High',"Voltality",'20d'])
for key, value in dftest[4].items():
    dfoutput['Critical Value (%s)'%key] = value
print(dfoutput)

<h2>Making data Stationary</h2>
Now, let us try to make data stationary by applying various methods
<h3>Estimating & Eliminating Trend</h3>

In [None]:
df_log = np.log(new_df['Close'])
plt.plot(df_log)
plt.grid(True)

<h3>Smoothing</h3>
Taking Moving averages of log values:

In [None]:
#moving_avg = pd.rolling_mean(df_log, 10, min_periods=1)
moving_avg = np.round(df_log.rolling(window = 10, center = False).mean(), 2)
plt.plot(df_log)
plt.plot(moving_avg, color='red')
plt.grid(True)

In [None]:
moving_avg_diff = df_log - moving_avg
moving_avg_diff.dropna(inplace = True)

def test_stats(ts):
    rolmean = ts.rolling(window=20, center = False).mean()
    rolstd = ts.rolling(window=20, center = False).std()

    #plotting
    orig = plt.plot(ts, color='blue',label='Original')
    mean = plt.plot(rolmean, color='red', label='Rolling Mean')
    std = plt.plot(rolstd, color='green', label = 'Rolling Std')
    plt.legend(loc='best')
    plt.title('Rolling Mean & Standard Deviation')
    plt.grid(True)
    plt.show(block=False)

test_stats(moving_avg_diff)

In [None]:
def adf(ts):
    print('Results of Dickey-Fuller Test:')
    dftest = adfuller(ts, autolag='AIC')
    dfoutput = pd.Series(dftest[0:4], index=['Open','High',"Voltality",'20d'])
    for key, value in dftest[4].items():
        dfoutput['Critical Value (%s)'%key] = value
    print(dfoutput)

adf(moving_avg_diff)

<h3>Exponentially Weighted Moving Average</h3>

In [None]:
expwighted_avg = df_log.ewm(halflife = 12).mean()

plt.plot(df_log)
plt.plot(expwighted_avg, color='red')
plt.grid(True)

In [None]:
df_log_emw_avg = df_log - expwighted_avg #difference
test_stats(df_log_emw_avg)

In [None]:
adf(df_log_emw_avg)

<h2>Checking Trends and Seasonality</h2>

In [None]:
#Take first difference:
df_log_diff = df_log - df_log.shift()
plt.plot(df_log_diff)

In [None]:
df_log_diff.dropna(inplace=True)
test_stats(df_log_diff)
adf(df_log_diff)

<h3>Decomposition</h3>

In [None]:
from statsmodels.tsa.seasonal import seasonal_decompose
decomposition = seasonal_decompose(df_log, freq=52)

trend = decomposition.trend
seasonal = decomposition.seasonal
residual = decomposition.resid

plt.subplot(411)
plt.plot(df_log, label='Original')
plt.legend(loc='best')
plt.subplot(412)
plt.plot(trend, label='Trend')
plt.legend(loc='best')
plt.subplot(413)
plt.plot(seasonal,label='Seasonality')
plt.legend(loc='best')
plt.subplot(414)
plt.plot(residual, label='Residuals')
plt.legend(loc='best')
plt.tight_layout()

In [None]:
df_log_decompose = residual
df_log_decompose.dropna(inplace=True)
test_stats(df_log_decompose)


<h1>Starting with Stocker</h1>
<h2>Potential Profit</h2>
We can then evaluate the potential profit we would have from those shares. You can also change the dates if you feel like trying to lose money! 

In [None]:
stock.buy_and_hold(start_date=start_date, end_date=end_date, nshares=100)

<h2>Changepoints</h2>

One of the most important concepts in a time-series is changepoints. These occur at the maximum value of the second derivative. If that doesn't make much sense, they are times when the series goes from increasing to decreasing or vice versa, or when the series goes from increasing slowly to increasing rapidly. 

We can easily view the changepoints identified by the Prophet model with the following method. This lists the changepoints and displays them on top of the actual data for comparison.

In [None]:
stock.changepoint_date_analysis()

# Predictions

Now that we have analyzed the stock, the next question is where is it going? For that we will have to turn to predictions! 
That is for another notebook, but here is a little idea of what we can do (check out the documentation on GitHub for full details).

In [None]:
model, future = stock.create_prophet_model(days=7)

%%html
<script>
    // AUTORUN ALL CELLS ON NOTEBOOK-LOAD!
    require(
        ['base/js/namespace', 'jquery'], 
        function(jupyter, $) {
            $(jupyter.events).on("kernel_ready.Kernel", function () {
                console.log("Auto-running all cells-below...");
                jupyter.actions.call('jupyter-notebook:run-all-cells-below');
                jupyter.actions.call('jupyter-notebook:save-notebook');
            });
        }
    );
</script>