Data_Analysis

Data Analysis on some famous datasets available online.

Content 📋

Kaggle
Read

Kaggle

S&P 500 Analysis and Prediction

Dataset
- S&P 500 stock data
Notebook
- S&P 500 Analysis and Prediction
Implementation Points
- Explored dataset to understand data
- The analysis was focused on AMAZON stock data
- Visualized the variation in Close, High, Low , Open using matplotlib
- Forecasting was done using Prophet, Facebook's libirary for time series forecasting
- Prophets plot and components showed the upward trend in Yearly as well as Monthly data of AMZN stocks
- plotly's graph_objects was used for creating OHLC(Open High Low Close) , CandleStick graph
- Analyzed American Airlines Stock for understanding seasonality trends
- Plotted monthly forecasted data to see the seasonal trends in each year

Stock Market Analysis and EDA

Dataset
- Stock Market
NoteBook
- Stock Market Analysis and EDA
Implementation Points
- Explored the Dataset to understand the structure
- The NYSE Composite, NYA was the focus of analysis among othe available indices
- Cleaned the data by dropping NA values and filled some with ffill method
- Made Date attribute to datetime in pandas, to get price fluctuations over time
- Removed Outliers and visualized Adjcent Close over time(Date)
- used matplotlib, plotly to create pie charts and CandleStick on the refined data
- Visualized correlation between attributes, 100 days simple moving average, 200 days simple moving average with matplotlib

Stock Market Analysis

Dataset
- Yahoo Fianace by Python yfinance
NoteBook
- Market Analysis Basics
Implementation Points
- Downloaded stock market data frmo Yahoo Finance website using yfinance
- Explored and visualized time-series data using pandas, matplotlib, seaborn
- Measured the correlation between stocks
- Measured the risk to invest in them and plotted expected risk vs return
- Predicted closing price using LSTM for Nvidia (NVDA)

Time Series EDA on World War II

Dataset
- Aerial Bombing Operations in World War II
- Weather Conditions in World War Two
Notebook
- Time Series EDA World War II
Implementation Points
- cleaned data to remove uncertainty to ease visualization
- used scattergeo to see the Bombing path, weather station location
- weather data was not staionary as per first Dickey-Fuller test
- used popular methods to get a constant mean
  - moving average
  - differencing method
- things considered to believe data is now 99% stationary
  - by looking at plot , mean looks constant
  - variance also looked constant
  - test stats from Dickey-Fuller test is smaller that 1% of critical values
- Forecasted Time Series
  - used output from differencing method
  - used prediction method ARIMA after finding the constant by ACF, PACF
  - visualized the ARIMA model prediction and got the mean squared error

Time Series Basics

Dataset
Notebook
- Time Series Basics
Implementation Points
- Explored Univariate and multivariate timeSeries
- Visualized datasets to understand the Components of TimeSeries
  - Trend
    - Deterministic Trends
    - Stochastic Trends
  - Seasonality
  - Cyclic Patterns
  - Noise
- Models for Decomposition of TimeSeries
  - Additive Model
    - Additive Decomposition
  - Multiplicative Model
    - Multiplicative Decomposition
- Visualized the Seasonality in datasets
- TimeSeries Forecasting Techniques
  - Moving Average
    - Centred Moving Average
    - Trailing Moving Average
- Handling Missing Values
- Forcasting Requirements
  - Outliers
  - Resampling
  - Up-sampling
  - Down-sampling
- Measuring Accuracy
  - Mean Absolute Error
  - Mean Absolute Percentage Error
  - Mean Squared Error
  - Root Mean Square Error
- ETS ( Error , Trend, Seasonality ) Models
  - SES
    - Simple smoothing with additive errors
  - Holt
    - Holt's linear method with additive errors
      - Double Exponential Smoothing
  - Holt-Winters
    - Holt Winter's linear method with additive errors
      - Multi-step forecast
- Auto Regressive Models
  - Auto-Correlation function ( ACF ), Partial Auto-Correlation function ( PACF )
  - Stationarity check using Dickey Fuller Test
  - ARIMA Model ( AutoRegressive Integrated Moving Average )
  - Auto ARIMA
    - using AIC ( Akaike Information Criterion ) and BIC (Bayesian Information Criterion ) for model selection

Spotify Analysis

Dataset
- Spotify Datasets
Notebook
- Spotify Analysis
Implementation Points
- The dataset provides audio information for almost 600k Spotify tracks with 20 features
- Different Visualization including WordCloud, barplot gave us the most popular artist, number of songs per year, most popular songs, etc
- Plotting histogram and boxplot showed the skewness of features in dataset
- Added a new feature of song being highly popular if popularity is greater than 50, which was resampled using RandomOverSampler
- Built a pipeline for columns
  - duration : SimpleImputer, FunctionTransformer, RobustScaler
  - categorical: SimpleImputer, OneHotEncoder
  - numerical columns : SimpleImputer, RobustScaler
- Used this Pipeline on LogisticRegression, RandomForestClassifier, XGBClassifier to visualize the confusion matric for each of them as heatmap
- The Importance Feature in each of them were found to be explicit, loudness, explicit column of the dataset.

Airbnb Analysis

Dataset
- NYC Airbnb Dataset
NoteBook
- Airbnb Data Anaysis and Prediction
Implemention Points
- The Dataset describes the listing activity and metrics in NYC for 2019
- used folium to visualize geographic location on an interactive map
- Filled NaN values using KNNImputer
- Analyzed categorical and numerical variable by plot countplots on them
- Visualized Distribution by Neighbourhood groups which showed Manhattan being the priciest neighbourhood_group for Entire home / apt
- Analysized outliers and replaced them with threshold which was calculated using quantile 1 and quantile 3
- Added features and predicted the data using different model
- The R2Score, Mean absolute error, Mean squared error, root mean squared error values showed CatBoostRegressor being the best even after Hyperparameter optimization
- Visualizing the feature importance showed minimum_nights, annual_incoem, total_cost were the top three drivers for this model output

E-Commerce Analysis

Dataset
- E-Commerce Data
Notebook
- E-Commerce Data Analysis
Implementation Points
- Exploring and Columnwise visualization for each column
- Analyzing negative values to understand the dataset and adding features to it
- Detecting outliers using scatterplot and quantile
- Visualizing UnitPrice, Quantity, Sales ( feature )
- Cleaned data for modeling and Bucketized UnitPrice, Quantity, dates
- Scaled feature and tested the data on different models:
  - Linear Regression
  - DecisionTree Regessor
  - Random Forest Regression
- Calculated Mean Absolute Error , Mean Squared Error and R2 Score of each of them.

NBA Data Analysis

Dataset
- NBA Players stats(2023 season)
Notebook
- NBA Data Analysis
Implementation Points
- Eplored the datataset and cleaned it for ease of use.
- Used plotly.express and plotly.graph_objects to visualize Players Position with respect to various attributes
- Dropped columns based on high correlation to improve performance of analysis
- Modeling the data using:
  - Linear Regression
  - K-Nearest Neighbors (KNN)
  - Decision Tree Regressor
  - RandomForest Regressor
- Calculated the r2 score , which showed Linear Regression giving the best value out of all
- Visual Comparison of Predicted vs Actual Points

Premier League Data Analysis

Dataset
- Premier League Player Statistics
Notebook
- Premier League Data Analysis
Implementation Points
- Explored the dataset and visualized missing values
- Used plotly.express to visualize :
  - Countried most represented
  - Players appearance
  - Player's age
- Used plotly.subplots to visualize Players Stats by playing position
- Used plotly.graph_objects to plot graphs for the way goal was made and the mistakes by players

Diamond Prices Data Anaysis

Dataset
- Diamonds
Notebook
- Diamond Prices Data Anaysis
Implementation Points
- We started with exploring the dataset which gave us an idea about the attributes present in it.
- We later made some changes and visualized the following:
  - Carat
  - Cut
  - Color
  - Clarity
  - Depth
  - Dimensions
  - along with their comparision with Price.
- Introduced a new feature Volume to see the relationship between volume of diamond and price.
- Divided the dataset into test and train to use them for evaluating with different Algorithms including :
  - Linear Regression
  - Lasso Regression
  - AdaBoost Regression
  - Ridge Regression
  - GradientBoosting Regression
  - RandomForest Regerssion
  - KNeighours Regression
- which gave us the r2 values ( Co-efficient of determination ) and visualized the same which showed us RandomForest Regressor giving the highest r2 value.

Titanic Data Anaysis

Dataset
- Titanic
Notebook
- Titanic Data Anaysis
Implementation Points
- Reviwed the train and test data provided
- Calculated the survival rate of men and women
- Predicted suvival rate using RandomForestClassifier

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
Assets		Assets
Notebooks		Notebooks
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data_Analysis

Content 📋

Kaggle

S&P 500 Analysis and Prediction

Stock Market Analysis and EDA

Stock Market Analysis

Time Series EDA on World War II

Time Series Basics

Spotify Analysis

Airbnb Analysis

E-Commerce Analysis

NBA Data Analysis

Premier League Data Analysis

Diamond Prices Data Anaysis

Titanic Data Anaysis

Reads

About

Releases

Languages

sumeetgedam/Data_Analysis

Folders and files

Latest commit

History

Repository files navigation

Data_Analysis

Content 📋

Kaggle

S&P 500 Analysis and Prediction

Stock Market Analysis and EDA

Stock Market Analysis

Time Series EDA on World War II

Time Series Basics

Spotify Analysis

Airbnb Analysis

E-Commerce Analysis

NBA Data Analysis

Premier League Data Analysis

Diamond Prices Data Anaysis

Titanic Data Anaysis

Reads

About

Topics

Resources

Stars

Watchers

Forks

Releases

Languages