# NASDAQ 100 Stocks Predictor - Exploratory Data Analysis (EDA)

**Author:** Renish Kanjiyani <br>
**Notebook:** 2 - EDA <br>
**Date:** 05/11/2023 <br>
**E-mail:** kanjiyanirenish2@gmail.com

---

# Table of Contents:

---

## Introduction:

---

## EDA:

In [54]:
# Loading all the packages required

import numpy as np 
import pandas as pd 
import seaborn as sns 
import matplotlib.pyplot as plt

import warnings
warnings.filterwarnings('ignore')
warnings.filterwarnings(action='once')


In [55]:
# Loading the dataset 

stocks_df = pd.read_csv('../nasdaq_stocks_100.csv', sep='\t')

In [56]:
# Let's view our dataset 

stocks_df.head(10)

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume,Name
0,2010-01-04,7.6225,7.660714,7.585,7.643214,6.562591,493729600,AAPL
1,2010-01-05,7.664286,7.699643,7.616071,7.656429,6.573935,601904800,AAPL
2,2010-01-06,7.656429,7.686786,7.526786,7.534643,6.469369,552160000,AAPL
3,2010-01-07,7.5625,7.571429,7.466071,7.520714,6.457407,477131200,AAPL
4,2010-01-08,7.510714,7.571429,7.466429,7.570714,6.500339,447610800,AAPL
5,2010-01-11,7.6,7.607143,7.444643,7.503929,6.442997,462229600,AAPL
6,2010-01-12,7.471071,7.491786,7.372143,7.418571,6.369709,594459600,AAPL
7,2010-01-13,7.423929,7.533214,7.289286,7.523214,6.459555,605892000,AAPL
8,2010-01-14,7.503929,7.516429,7.465,7.479643,6.422143,432894000,AAPL
9,2010-01-15,7.533214,7.557143,7.3525,7.354643,6.314816,594067600,AAPL


In [57]:
# Let's check the shape of our dataframe 

stocks_df.shape

(271680, 8)

In [58]:
print(f"Our dataframe has {stocks_df.shape[0]} rows and {stocks_df.shape[1]} columns")

Our dataframe has 271680 rows and 8 columns


In [59]:
# Let's view more information on our dataframe 

stocks_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 271680 entries, 0 to 271679
Data columns (total 8 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   Date       271680 non-null  object 
 1   Open       271680 non-null  float64
 2   High       271680 non-null  float64
 3   Low        271680 non-null  float64
 4   Close      271680 non-null  float64
 5   Adj Close  271680 non-null  float64
 6   Volume     271680 non-null  int64  
 7   Name       271680 non-null  object 
dtypes: float64(5), int64(1), object(2)
memory usage: 16.6+ MB


**Observations:** 
- We imported our clean dataset and we can perform our EDA on it. 

### Converting the `Date` column to a `datetime` format

In [60]:
# Let's convert the column now 

stocks_df['Date'] = pd.to_datetime(stocks_df['Date'])

In [61]:
# Sanity Check 
stocks_df.dtypes

Date         datetime64[ns]
Open                float64
High                float64
Low                 float64
Close               float64
Adj Close           float64
Volume                int64
Name                 object
dtype: object

- We successfully converted the `date` column to `datetime` column. 

### Exploring the `Name` column:

In [62]:
# Let's take a look at how many unique values we have in our name column

stocks_df['Name'].unique()

array(['AAPL', 'ADBE', 'ADI', 'ADP', 'ADSK', 'AEP', 'ALGN', 'AMAT', 'AMD',
       'AMGN', 'AMZN', 'ANSS', 'ASML', 'ATVI', 'AVGO', 'BIDU', 'BIIB',
       'BKNG', 'CDNS', 'CDW', 'CERN', 'CHKP', 'CHTR', 'CMCSA', 'COST',
       'CPRT', 'CRWD', 'CSCO', 'CSX', 'CTAS', 'CTSH', 'DLTR', 'DOCU',
       'DXCM', 'EA', 'EBAY', 'EXC', 'FAST', 'FB', 'FISV', 'FOX', 'FOXA',
       'GILD', 'GOOG', 'GOOGL', 'HON', 'IDXX', 'ILMN', 'INCY', 'INTC',
       'INTU', 'ISRG', 'JD', 'KDP', 'KHC', 'KLAC', 'LRCX', 'LULU', 'MAR',
       'MCHP', 'MDLZ', 'MELI', 'MNST', 'MRNA', 'MRVL', 'MSFT', 'MTCH',
       'MU', 'NFLX', 'NTES', 'NVDA', 'NXPI', 'OKTA', 'ORLY', 'PAYX',
       'PCAR', 'PDD', 'PEP', 'PTON', 'PYPL', 'QCOM', 'REGN', 'ROST',
       'SBUX', 'SGEN', 'SIRI', 'SNPS', 'SPLK', 'SWKS', 'TCOM', 'TEAM',
       'TMUS', 'TSLA', 'TXN', 'VRSK', 'VRSN', 'VRTX', 'WBA', 'WDAY',
       'XEL', 'XLNX', 'ZM'], dtype=object)

In [66]:
stocks_df['Name'].value_counts().nunique


<bound method IndexOpsMixin.nunique of AAPL    2943
NVDA    2943
NFLX    2943
MU      2943
MTCH    2943
        ... 
FOXA     632
FOX      631
ZM       605
CRWD     568
PTON     494
Name: Name, Length: 102, dtype: int64>

**Observations:** 
- We can see that certain stocks such as `AAPL (APPLE)` have `2943` unique values counts whereas the others such as `FOX` only have `631`. 

- Therefore, what we can do is look at the top 10 constituents of the NASDAQ 100 Index. 

<b>The top 10 constitunets of the NASDAQ 100 Index are: </b> <br>

`AAPL` (Apple), `MSFT` (Microsoft), `GOOG` (Google Class C), `GOOGL` (Google Class A), `AMZN` (Amazon), `TSLA` (Tesla), `FB` (Facebook/Meta), `NVDA` (Nvidia), `PEP` (Pepsi), `COST` (Costco).


<b>Breakdown of the top 10 constituents:</b>

Now that we have indentified our top 10 constituents, it is not practically possible to conduct analysis on every individual stock. Therefore, what we can do is group certain stocks and then perform further analysis on it. 

We can group the following individual stocks: `AAPL`, `MSFT`, `GOOG`, `GOOGL`, `AMZN` and `FB`. As per the stocks we grouped, it can be named as the `tech_giants`.

The assumption we are making behind this is that as the stock prices of the `tech_giants` company changes, this will directly correlate to other stocks in the index as the giants are controlling the majority of the market share. Therefore, it is safe to assume that performing analysis on these will give us an idea of how all the different stocks behave in the `NASDAQ-100` index. 

### Grouping Stocks

Let's create a new dataframe that will consist of the following stocks: 

`AAPL`, `MSFT`, `GOOG`, `GOOGL`, `AMZN` & `FB` <-- We will name the following as `tech_giants`.

In [71]:
# Creating the dataframe

tech_giants = stocks_df[stocks_df['Name'].isin(['AAPL', 'MSFT', 'GOOG', 'GOOGL', 'AMZN', 'FB'])]


In [74]:
# Sanity Check 

tech_giants.sample(10)

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume,Name
31769,2019-04-22,1855.400024,1888.420044,1845.640015,1887.310059,1887.310059,3373800,AMZN
118324,2019-05-03,1173.650024,1186.800049,1169.0,1185.400024,1185.400024,1980700,GOOG
116562,2012-05-01,300.767639,304.658081,298.974365,301.086456,301.086456,4019610,GOOG
32047,2020-05-28,2384.330078,2436.969971,2378.22998,2401.100098,2401.100098,3190200,AMZN
30008,2012-04-19,192.929993,194.550003,189.75,191.100006,191.100006,4002400,AMZN
178807,2021-03-18,232.559998,234.190002,230.330002,230.720001,229.748642,34833000,MSFT
29663,2010-12-06,175.520004,178.429993,174.600006,178.050003,178.050003,5654200,AMZN
121171,2018-12-13,1075.670044,1088.420044,1064.98999,1073.540039,1073.540039,1249300,GOOGL
106638,2013-01-02,27.440001,28.18,27.42,28.0,28.0,69846400,FB
177559,2016-04-04,55.43,55.66,55.0,55.43,50.613754,18928800,MSFT


In [75]:
# Checking the shape of the new dataframe 

tech_giants.shape

(17059, 8)

In [77]:
# Check the unique names in the Name column

tech_giants['Name'].unique()

array(['AAPL', 'AMZN', 'FB', 'GOOG', 'GOOGL', 'MSFT'], dtype=object)

**Observations:**

- We were able to successfully create a new dataframe that contains the top 10 stocks in the NASDAQ 100 index. 

- The new dataframe `tech_giants` will now be used for further explorations and modelling. 

---

## Conclusion: