---

# NASDAQ 100 Stocks Predictor - Introduction and Data Cleaning

**Author:** Renish Kanjiyani <br>
**Notebook:** 1 - Intro & Data Cleaning <br>
**Date:** 05/11/2023 <br>

---

## Table of Contents:

## 1. [Introduction](#1)

### [1.1 Background Information](#1.1)

### [1.2 Business Question](#1.2)

### [1.3 Dataset](#1.3)

### [1.4 Data Dictionary](#1.4)

### [1.5 Goal](#1.5)


## 2. [Data Cleaning](#2)

### [2.1 Importing Packages](#2.1)

### [2.2 Shape of dataframe](#2.2)  

### [2.3 Information on dataframe](#2.3)

### [2.4 Null Values](#2.4)

### [2.5 Duplicates](#2.5)


## 3. [Conclusion](#3)

----------------------------------------

<a id=1></a>
## INTRODUCTION:

<a id=1.1></a>
### Background Information:

Did you know that `61%` of U.S. adults own stocks? The lowest this percentage has hit was in `2013` and `2016` when only `52%` of adults owned stocks. For over a decade now, more than 50% of the adult population in the U.S. are regularly investing in the stock market. Investing has always been a passion, while for others it has now become a hobby and a way to make money. One would mainly invest for a short-term while the other would prefer investing for a long-term. The only difference is that people that invest for short-term are regularly buying/selling. 

When you think about buying and selling there are so many factors involved that could potentially lead to a profit or a loss per trade. The stock market is very volatile, prices increase/decrease in no time. Due to this volatility one carefully invests in stocks that are guaranteed to result some fruit. A lot of research goes into the particular stock before it is bought or sold by individuals. There are various technologies in the world that help individuals build a portfolio based on the risk level they set. This is then managed by the company you signed up with. One would only get access to the dashboard where they are able to see their returns. But what if we could find a way where individuals can grow their portfolio by creating a technology that could help them determine the stock movement and based on that invest at the right time? 

<a id=1.2></a>
### Business Question:

Can we leverage machine learning models to help us predict stocks movement? 

**Value:**

Through the positive/negative movement we will be able to predict whether the stock is going to go up or down based on historical data. Through this individuals then have a chance to choose when/what stocks they would prefer investing in.

<a id=1.3></a>
### Dataset: 

To further carry this investigation and creating an ML model, we need a dataset. The dataset I chose to base my investgation on was the NASDAQ-100 stocks. The NASDAQ-100 index consists of the largest non-financial companies. Some of the very renowned companies are `Apple`, `Microsoft`, `Amazon`, `Google`, `Meta`, `Tesla` and many more. The index is ranked third on the list of stock exchanges by market capitalization, after the `S&P 500`  and the `Dow Jones Industrial Average` in the United States. The original source of the dataset is the `yahoo finance api`, which was then uploaded onto `Kaggle`. The data ranges from 2010 to 2021 and can be downloaded <a href='https://drive.google.com/file/d/1Re1QPoW2kPOm5hAK3lt8hU3-FoNOVl6-/view?usp=sharing'>here</a>.

<a id=1.4></a>
### Data Dictionary:

In order to perform further analysis, let's take the time to understand our dataset. The dataset contains a total of 8 columns which is described below:

| **Column** | **Meaning**                                                                    | **Data Type** |
|------------|--------------------------------------------------------------------------------|---------------|
| Date       | Specific days by which the data was recorded (categorical)                     | object        |
| Open       | The price at which the stock opened on the day (numerical)                     | float         |
| High       | Price at which the stock was at the highest on that day (numerical)            | float         |
| Low        | Price at which the stock was at the lowest on that day (numerical)             | float         |
| Close      | The price at which the stock closed on the day (numerical)                     | float         |
| Adj Close  | Closing price after considering external factors (i.e - dividends) (numerical) | float         |
| Volume     | Total volume of stocks traded (numerical)                                      | int           |
| Name       | Name of the stock traded (categorical)                                         | object        |


<a id=1.5></a>
### Goal:

The end goal is to train a machine learning model that can accurately predict the stock price movement, in particular looking at the positive or negative movements. Based on this information we can then predict whether the stock price would be going up or down the next day. 

-----------

<a id=2></a>
## DATA CLEANING:

<a id=2.1></a>
### Importing Packages:

In [2]:
# Loading all the packages required

import numpy as np 
import pandas as pd 
import seaborn as sns 
import matplotlib.pyplot as plt

import warnings
warnings.filterwarnings('ignore')
warnings.filterwarnings(action='once')

In [3]:
# Loading the dataset 

stocks_df = pd.read_csv\
('/Users/renishkanjiyani/Documents/BrainStation/Final Project/nasdaq_stocks_100.csv', sep='\t')

In [4]:
# View the first 10 rows of our dataframe

stocks_df.head(10)

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume,Name
0,2010-01-04,7.6225,7.660714,7.585,7.643214,6.562591,493729600,AAPL
1,2010-01-05,7.664286,7.699643,7.616071,7.656429,6.573935,601904800,AAPL
2,2010-01-06,7.656429,7.686786,7.526786,7.534643,6.469369,552160000,AAPL
3,2010-01-07,7.5625,7.571429,7.466071,7.520714,6.457407,477131200,AAPL
4,2010-01-08,7.510714,7.571429,7.466429,7.570714,6.500339,447610800,AAPL
5,2010-01-11,7.6,7.607143,7.444643,7.503929,6.442997,462229600,AAPL
6,2010-01-12,7.471071,7.491786,7.372143,7.418571,6.369709,594459600,AAPL
7,2010-01-13,7.423929,7.533214,7.289286,7.523214,6.459555,605892000,AAPL
8,2010-01-14,7.503929,7.516429,7.465,7.479643,6.422143,432894000,AAPL
9,2010-01-15,7.533214,7.557143,7.3525,7.354643,6.314816,594067600,AAPL


In [5]:
# View the last 10 rows of our dataframe

stocks_df.tail(10)

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume,Name
271670,2021-08-27,343.880005,344.779999,337.649994,340.809998,340.809998,3089600,ZM
271671,2021-08-30,341.700012,348.299988,339.649994,347.5,347.5,10094400,ZM
271672,2021-08-31,294.0,295.869995,288.299988,289.5,289.5,34582900,ZM
271673,2021-09-01,292.850006,299.399994,290.049988,290.859985,290.859985,14996000,ZM
271674,2021-09-02,292.799988,296.690002,290.410004,295.089996,295.089996,6645200,ZM
271675,2021-09-03,295.325012,301.804993,292.029999,298.290009,298.290009,6127900,ZM
271676,2021-09-07,298.295013,300.980011,294.799988,299.959991,299.959991,4251900,ZM
271677,2021-09-08,299.549988,299.959991,290.529999,293.600006,293.600006,3934400,ZM
271678,2021-09-09,292.160004,297.570007,291.130005,295.859985,295.859985,3350100,ZM
271679,2021-09-10,296.910004,306.263,296.809998,301.5,301.5,6089600,ZM


<a id=2.2></a>
### Shape of dataframe:

In [6]:
# Let's check the shape of our dataframe

stocks_df.shape

(271680, 8)

In [7]:
# Print the number of rows and columns in our dataframe 

print(f"Our dataframe consists of {stocks_df.shape[0]} rows and {stocks_df.shape[1]} columns")

Our dataframe consists of 271680 rows and 8 columns


<a id=2.3></a>
### Information on dataframe:

In [8]:
# View more information on what our dataframe comprises of

stocks_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 271680 entries, 0 to 271679
Data columns (total 8 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   Date       271680 non-null  object 
 1   Open       271680 non-null  float64
 2   High       271680 non-null  float64
 3   Low        271680 non-null  float64
 4   Close      271680 non-null  float64
 5   Adj Close  271680 non-null  float64
 6   Volume     271680 non-null  int64  
 7   Name       271680 non-null  object 
dtypes: float64(5), int64(1), object(2)
memory usage: 16.6+ MB


**Observations:**
- When we look at our dataset we can see that the total rows which we checked earlier stands at `271680` rows and in this case we can validate that none of the `columns` have missing values.
- What we can do is convert the `Date` column into a `datetime` format using our `pandas` function. This will help us when we do further EDA and modelling. 

In [9]:
# Statistics table for our dataframe 

stocks_df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Open,271680.0,130.1471,259.4633,0.61,32.55,59.81,117.14,3744.0
High,271680.0,131.6786,262.2492,0.66,32.95,60.505,118.47,3773.08
Low,271680.0,128.5645,256.5228,0.61,32.15,59.12,115.82,3696.79
Close,271680.0,130.174,259.455,0.65,32.57,59.85,117.19,3731.41
Adj Close,271680.0,126.9297,260.1569,0.61227,28.00198,55.6,114.7055,3731.41
Volume,271680.0,10526700.0,39248020.0,0.0,1332175.0,2759400.0,6889500.0,1880998000.0


**Observations:** 
- We can see that through the statistics table of our dataframe, some of the values are very large and this is because it is taking account of every individual stock that is within the NASDAQ 100 stock index. 

- What we can do is split the dataframe into individual stocks and then look at the statistics measure in more depth. 

<a id=2.4></a>
### Null Values:

Let's check for any null values or duplicates in our dataframe:

In [10]:
# Checking for null values 

stocks_df.isna()

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume,Name
0,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...
271675,False,False,False,False,False,False,False,False
271676,False,False,False,False,False,False,False,False
271677,False,False,False,False,False,False,False,False
271678,False,False,False,False,False,False,False,False


In [11]:
# Let's get a sum to see an overview of the null values 

stocks_df.isna().sum()

Date         0
Open         0
High         0
Low          0
Close        0
Adj Close    0
Volume       0
Name         0
dtype: int64

**Observations:** 
- Through our information table we saw that all our `columns` in the dataframe had the same number of rows. Therefore, we concluded that there are no null values. But we still did a sanity check to find, and we broke it down to individual columns. As the individual columns suggest we have no null values.

<a id=2.5></a>
### Duplicates:

After checking for null values, we can now go ahead and look for duplicate values. 

In [12]:
# Let's check for duplicates 

stocks_df.duplicated()

0         False
1         False
2         False
3         False
4         False
          ...  
271675    False
271676    False
271677    False
271678    False
271679    False
Length: 271680, dtype: bool

In [13]:
# Let's break it down further to see an overview 

stocks_df.duplicated().sum()

0

**Observations:**
- Our dataframe does not contain any duplicated rows, therefore we do not need to drop/impute any values in this case. 

#### <i>Now that we have a clean dataset, we can go ahead and perform our EDA!</i>

---

<a id=3></a>
## CONCLUSION:

In conclusion: 

- We explored the dataset and tuned it in order to conduct further analysis. 
- The total number of rows in our dataset adds up to 271,680 which would be solid to perform our analysis. 

In book 2 which is titled 'EDA-1' we will analyze our dataset further and prepare it for the final modelling process. 

---