Skip to content

Repository to track Data Analysis done on various datasets available online

Notifications You must be signed in to change notification settings

sumeetgedam/Data_Analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 

Repository files navigation

Data_Analysis

Data Analysis on some famous datasets available online.

Content 📋

  1. Kaggle

  2. Read

Kaggle

S&P 500 Analysis and Prediction

  • Dataset

  • Notebook

  • Implementation Points

    • Explored dataset to understand data
    • The analysis was focused on AMAZON stock data
    • Visualized the variation in Close, High, Low , Open using matplotlib
    • Forecasting was done using Prophet, Facebook's libirary for time series forecasting
    • Prophets plot and components showed the upward trend in Yearly as well as Monthly data of AMZN stocks
    • plotly's graph_objects was used for creating OHLC(Open High Low Close) , CandleStick graph
    • Analyzed American Airlines Stock for understanding seasonality trends
    • Plotted monthly forecasted data to see the seasonal trends in each year

Stock Market Analysis and EDA

  • Dataset

  • NoteBook

  • Implementation Points

    • Explored the Dataset to understand the structure
    • The NYSE Composite, NYA was the focus of analysis among othe available indices
    • Cleaned the data by dropping NA values and filled some with ffill method
    • Made Date attribute to datetime in pandas, to get price fluctuations over time
    • Removed Outliers and visualized Adjcent Close over time(Date)
    • used matplotlib, plotly to create pie charts and CandleStick on the refined data
    • Visualized correlation between attributes, 100 days simple moving average, 200 days simple moving average with matplotlib

Stock Market Analysis

  • Dataset

  • NoteBook

  • Implementation Points

    • Downloaded stock market data frmo Yahoo Finance website using yfinance
    • Explored and visualized time-series data using pandas, matplotlib, seaborn
    • Measured the correlation between stocks
    • Measured the risk to invest in them and plotted expected risk vs return
    • Predicted closing price using LSTM for Nvidia (NVDA)

Time Series EDA on World War II

  • Dataset

  • Notebook

  • Implementation Points

    • cleaned data to remove uncertainty to ease visualization
    • used scattergeo to see the Bombing path, weather station location
    • weather data was not staionary as per first Dickey-Fuller test
    • used popular methods to get a constant mean
      • moving average
      • differencing method
    • things considered to believe data is now 99% stationary
      • by looking at plot , mean looks constant
      • variance also looked constant
      • test stats from Dickey-Fuller test is smaller that 1% of critical values
    • Forecasted Time Series
      • used output from differencing method
      • used prediction method ARIMA after finding the constant by ACF, PACF
      • visualized the ARIMA model prediction and got the mean squared error

Time Series Basics

  • Dataset

  • Notebook

  • Implementation Points

    • Explored Univariate and multivariate timeSeries
    • Visualized datasets to understand the Components of TimeSeries
      • Trend
        • Deterministic Trends
        • Stochastic Trends
      • Seasonality
      • Cyclic Patterns
      • Noise
    • Models for Decomposition of TimeSeries
      • Additive Model
        • Additive Decomposition
      • Multiplicative Model
        • Multiplicative Decomposition
    • Visualized the Seasonality in datasets
    • TimeSeries Forecasting Techniques
      • Moving Average
        • Centred Moving Average
        • Trailing Moving Average
    • Handling Missing Values
    • Forcasting Requirements
      • Outliers
      • Resampling
      • Up-sampling
      • Down-sampling
    • Measuring Accuracy
      • Mean Absolute Error
      • Mean Absolute Percentage Error
      • Mean Squared Error
      • Root Mean Square Error
    • ETS ( Error , Trend, Seasonality ) Models
      • SES
        • Simple smoothing with additive errors
      • Holt
        • Holt's linear method with additive errors
          • Double Exponential Smoothing
      • Holt-Winters
        • Holt Winter's linear method with additive errors
          • Multi-step forecast
    • Auto Regressive Models
      • Auto-Correlation function ( ACF ), Partial Auto-Correlation function ( PACF )
      • Stationarity check using Dickey Fuller Test
      • ARIMA Model ( AutoRegressive Integrated Moving Average )
      • Auto ARIMA
        • using AIC ( Akaike Information Criterion ) and BIC (Bayesian Information Criterion ) for model selection

Spotify Analysis

  • Dataset

  • Notebook

  • Implementation Points

    • The dataset provides audio information for almost 600k Spotify tracks with 20 features
    • Different Visualization including WordCloud, barplot gave us the most popular artist, number of songs per year, most popular songs, etc
    • Plotting histogram and boxplot showed the skewness of features in dataset
    • Added a new feature of song being highly popular if popularity is greater than 50, which was resampled using RandomOverSampler
    • Built a pipeline for columns
      • duration : SimpleImputer, FunctionTransformer, RobustScaler
      • categorical: SimpleImputer, OneHotEncoder
      • numerical columns : SimpleImputer, RobustScaler
    • Used this Pipeline on LogisticRegression, RandomForestClassifier, XGBClassifier to visualize the confusion matric for each of them as heatmap
    • The Importance Feature in each of them were found to be explicit, loudness, explicit column of the dataset.

Airbnb Analysis

  • Dataset

  • NoteBook

  • Implemention Points

    • The Dataset describes the listing activity and metrics in NYC for 2019
    • used folium to visualize geographic location on an interactive map
    • Filled NaN values using KNNImputer
    • Analyzed categorical and numerical variable by plot countplots on them
    • Visualized Distribution by Neighbourhood groups which showed Manhattan being the priciest neighbourhood_group for Entire home / apt
    • Analysized outliers and replaced them with threshold which was calculated using quantile 1 and quantile 3
    • Added features and predicted the data using different model
    • The R2Score, Mean absolute error, Mean squared error, root mean squared error values showed CatBoostRegressor being the best even after Hyperparameter optimization
    • Visualizing the feature importance showed minimum_nights, annual_incoem, total_cost were the top three drivers for this model output

E-Commerce Analysis

  • Dataset

  • Notebook

  • Implementation Points

    • Exploring and Columnwise visualization for each column
    • Analyzing negative values to understand the dataset and adding features to it
    • Detecting outliers using scatterplot and quantile
    • Visualizing UnitPrice, Quantity, Sales ( feature )
    • Cleaned data for modeling and Bucketized UnitPrice, Quantity, dates
    • Scaled feature and tested the data on different models:
      • Linear Regression
      • DecisionTree Regessor
      • Random Forest Regression
    • Calculated Mean Absolute Error , Mean Squared Error and R2 Score of each of them.

NBA Data Analysis

  • Dataset

  • Notebook

  • Implementation Points

    • Eplored the datataset and cleaned it for ease of use.
    • Used plotly.express and plotly.graph_objects to visualize Players Position with respect to various attributes
    • Dropped columns based on high correlation to improve performance of analysis
    • Modeling the data using:
      • Linear Regression
      • K-Nearest Neighbors (KNN)
      • Decision Tree Regressor
      • RandomForest Regressor
    • Calculated the r2 score , which showed Linear Regression giving the best value out of all
    • Visual Comparison of Predicted vs Actual Points

Premier League Data Analysis

  • Dataset

  • Notebook

  • Implementation Points

    • Explored the dataset and visualized missing values
    • Used plotly.express to visualize :
      • Countried most represented
      • Players appearance
      • Player's age
    • Used plotly.subplots to visualize Players Stats by playing position
    • Used plotly.graph_objects to plot graphs for the way goal was made and the mistakes by players

Diamond Prices Data Anaysis

  • Dataset

  • Notebook

  • Implementation Points

    • We started with exploring the dataset which gave us an idea about the attributes present in it.
    • We later made some changes and visualized the following:
      • Carat
      • Cut
      • Color
      • Clarity
      • Depth
      • Dimensions
      • along with their comparision with Price.
    • Introduced a new feature Volume to see the relationship between volume of diamond and price.
    • Divided the dataset into test and train to use them for evaluating with different Algorithms including :
      • Linear Regression
      • Lasso Regression
      • AdaBoost Regression
      • Ridge Regression
      • GradientBoosting Regression
      • RandomForest Regerssion
      • KNeighours Regression
    • which gave us the r2 values ( Co-efficient of determination ) and visualized the same which showed us RandomForest Regressor giving the highest r2 value.

Titanic Data Anaysis

  • Dataset

  • Notebook

  • Implementation Points

    • Reviwed the train and test data provided
    • Calculated the survival rate of men and women
    • Predicted suvival rate using RandomForestClassifier

Reads