Data Analysis on some famous datasets available online.
-
Dataset
-
Notebook
-
Implementation Points
- Explored dataset to understand data
- The analysis was focused on AMAZON stock data
- Visualized the variation in Close, High, Low , Open using matplotlib
- Forecasting was done using Prophet, Facebook's libirary for time series forecasting
- Prophets plot and components showed the upward trend in Yearly as well as Monthly data of AMZN stocks
- plotly's graph_objects was used for creating OHLC(Open High Low Close) , CandleStick graph
- Analyzed American Airlines Stock for understanding seasonality trends
- Plotted monthly forecasted data to see the seasonal trends in each year
-
Dataset
-
NoteBook
-
Implementation Points
- Explored the Dataset to understand the structure
- The NYSE Composite, NYA was the focus of analysis among othe available indices
- Cleaned the data by dropping NA values and filled some with ffill method
- Made Date attribute to datetime in pandas, to get price fluctuations over time
- Removed Outliers and visualized Adjcent Close over time(Date)
- used matplotlib, plotly to create pie charts and CandleStick on the refined data
- Visualized correlation between attributes, 100 days simple moving average, 200 days simple moving average with matplotlib
-
Dataset
- Yahoo Fianace by Python yfinance
-
NoteBook
-
Implementation Points
- Downloaded stock market data frmo Yahoo Finance website using yfinance
- Explored and visualized time-series data using pandas, matplotlib, seaborn
- Measured the correlation between stocks
- Measured the risk to invest in them and plotted expected risk vs return
- Predicted closing price using LSTM for Nvidia (NVDA)
-
Dataset
-
Notebook
-
Implementation Points
- cleaned data to remove uncertainty to ease visualization
- used scattergeo to see the Bombing path, weather station location
- weather data was not staionary as per first Dickey-Fuller test
- used popular methods to get a constant mean
- moving average
- differencing method
- things considered to believe data is now 99% stationary
- by looking at plot , mean looks constant
- variance also looked constant
- test stats from Dickey-Fuller test is smaller that 1% of critical values
- Forecasted Time Series
- used output from differencing method
- used prediction method ARIMA after finding the constant by ACF, PACF
- visualized the ARIMA model prediction and got the mean squared error
-
Dataset
-
Notebook
-
Implementation Points
- Explored Univariate and multivariate timeSeries
- Visualized datasets to understand the Components of TimeSeries
- Trend
- Deterministic Trends
- Stochastic Trends
- Seasonality
- Cyclic Patterns
- Noise
- Trend
- Models for Decomposition of TimeSeries
- Additive Model
- Additive Decomposition
- Multiplicative Model
- Multiplicative Decomposition
- Additive Model
- Visualized the Seasonality in datasets
- TimeSeries Forecasting Techniques
- Moving Average
- Centred Moving Average
- Trailing Moving Average
- Moving Average
- Handling Missing Values
- Forcasting Requirements
- Outliers
- Resampling
- Up-sampling
- Down-sampling
- Measuring Accuracy
- Mean Absolute Error
- Mean Absolute Percentage Error
- Mean Squared Error
- Root Mean Square Error
- ETS ( Error , Trend, Seasonality ) Models
- SES
- Simple smoothing with additive errors
- Holt
- Holt's linear method with additive errors
- Double Exponential Smoothing
- Holt's linear method with additive errors
- Holt-Winters
- Holt Winter's linear method with additive errors
- Multi-step forecast
- Holt Winter's linear method with additive errors
- SES
- Auto Regressive Models
- Auto-Correlation function ( ACF ), Partial Auto-Correlation function ( PACF )
- Stationarity check using Dickey Fuller Test
- ARIMA Model ( AutoRegressive Integrated Moving Average )
- Auto ARIMA
- using AIC ( Akaike Information Criterion ) and BIC (Bayesian Information Criterion ) for model selection
-
Dataset
-
Notebook
-
Implementation Points
- The dataset provides audio information for almost 600k Spotify tracks with 20 features
- Different Visualization including WordCloud, barplot gave us the most popular artist, number of songs per year, most popular songs, etc
- Plotting histogram and boxplot showed the skewness of features in dataset
- Added a new feature of song being highly popular if popularity is greater than 50, which was resampled using RandomOverSampler
- Built a pipeline for columns
- duration : SimpleImputer, FunctionTransformer, RobustScaler
- categorical: SimpleImputer, OneHotEncoder
- numerical columns : SimpleImputer, RobustScaler
- Used this Pipeline on LogisticRegression, RandomForestClassifier, XGBClassifier to visualize the confusion matric for each of them as heatmap
- The Importance Feature in each of them were found to be explicit, loudness, explicit column of the dataset.
-
Dataset
-
NoteBook
-
Implemention Points
- The Dataset describes the listing activity and metrics in NYC for 2019
- used folium to visualize geographic location on an interactive map
- Filled NaN values using KNNImputer
- Analyzed categorical and numerical variable by plot countplots on them
- Visualized Distribution by Neighbourhood groups which showed Manhattan being the priciest neighbourhood_group for Entire home / apt
- Analysized outliers and replaced them with threshold which was calculated using quantile 1 and quantile 3
- Added features and predicted the data using different model
- The R2Score, Mean absolute error, Mean squared error, root mean squared error values showed CatBoostRegressor being the best even after Hyperparameter optimization
- Visualizing the feature importance showed minimum_nights, annual_incoem, total_cost were the top three drivers for this model output
-
Dataset
-
Notebook
-
Implementation Points
- Exploring and Columnwise visualization for each column
- Analyzing negative values to understand the dataset and adding features to it
- Detecting outliers using scatterplot and quantile
- Visualizing UnitPrice, Quantity, Sales ( feature )
- Cleaned data for modeling and Bucketized UnitPrice, Quantity, dates
- Scaled feature and tested the data on different models:
- Linear Regression
- DecisionTree Regessor
- Random Forest Regression
- Calculated Mean Absolute Error , Mean Squared Error and R2 Score of each of them.
-
Dataset
-
Notebook
-
Implementation Points
- Eplored the datataset and cleaned it for ease of use.
- Used plotly.express and plotly.graph_objects to visualize Players Position with respect to various attributes
- Dropped columns based on high correlation to improve performance of analysis
- Modeling the data using:
- Linear Regression
- K-Nearest Neighbors (KNN)
- Decision Tree Regressor
- RandomForest Regressor
- Calculated the r2 score , which showed Linear Regression giving the best value out of all
- Visual Comparison of Predicted vs Actual Points
-
Dataset
-
Notebook
-
Implementation Points
- Explored the dataset and visualized missing values
- Used plotly.express to visualize :
- Countried most represented
- Players appearance
- Player's age
- Used plotly.subplots to visualize Players Stats by playing position
- Used plotly.graph_objects to plot graphs for the way goal was made and the mistakes by players
-
Dataset
-
Notebook
-
Implementation Points
- We started with exploring the dataset which gave us an idea about the attributes present in it.
- We later made some changes and visualized the following:
- Carat
- Cut
- Color
- Clarity
- Depth
- Dimensions
- along with their comparision with Price.
- Introduced a new feature Volume to see the relationship between volume of diamond and price.
- Divided the dataset into test and train to use them for evaluating with different Algorithms including :
- Linear Regression
- Lasso Regression
- AdaBoost Regression
- Ridge Regression
- GradientBoosting Regression
- RandomForest Regerssion
- KNeighours Regression
- which gave us the r2 values ( Co-efficient of determination ) and visualized the same which showed us RandomForest Regressor giving the highest r2 value.
-
Dataset
-
Notebook
-
Implementation Points
- Reviwed the train and test data provided
- Calculated the survival rate of men and women
- Predicted suvival rate using RandomForestClassifier