Data is always amazing. If we process them properly, we can obtain great insights. This repository contains my day-to-day activities and learnings of my Data Analytics and Machine Learning Journey. Before recording my experience, here are the skillsets I already possess with respect to Data Analytics and ML:
- Python
- Hadoop MapReduce (with both JAVA and Python)
- Working with Spark RDDs and PySpark
- Elastic MapReduce (AWS)
Learning Math behind algorithms is an essential part of learning Machine Learning.
I started reading Mathematics for Machine Learning book by Marc Peter Deisenroth, A. Aldo Faisal, and Cheng Soon Ong. In order to get a visual representation of the topics, I also watched the Essence of Linear Algebra Essential series by 3Blue1Brown (
As Albert Einstein had quoted, "If you can't explain it simply, you don't understand it well enough", so I have written an article explaining the topics that I have learned today in a simple manner.
Article Link:
Topics covered in the article:
- Scalars
- Vectors
- Basis vectors
- Span of vectors
- Linear Dependency
- Matrix
- Matrix multiplication
- Inverse of matrix
- Matrix Transpose
- Inner Products
I also enrolled myself into the Machine Learning course taught by Andrew NG ( I completed the first week's portions today. I also converted my learning notes into an article.
Article Link:
Topics learned today and covered in the article:
- Univariate Linear Regression
- Cost/Loss function (with one variable)
- Gradient Descent (with one variable)
I enrolled in a Introduction to Big Data course( today to learn more about addressing the questions of why we need big data and how a big data strategy is formed. I completed the course and I have also converted my personal notes from the course into a Medium article.
Article Link:
Topics covered in the article:
- Introduction to Big Data
- Who needs Big data solutions
- Building a Big Data strategy for organizations
- Steps in the Data Science process
I learned about Multivariate Linear Regression today and also completed Week-2 of the Machine Learning course taught by Andrew NG ( I have also written an article regarding Multivariate Linear regression in Medium with my understanding.
Article Link:
Topics covered in the article:
- Multivariate Linear Regression
- Gradient Descent for multiple variables
- How to check if gradient descent is working properly?
- Feature Scaling
- Mean Normalization
- Features and Polynomial Regression
- Normal Equation
- Normal Equation and Non-invertibility
I also implemented a Univariate Regression Model using the Salary Dataset (, in order to predict the salary of employees.
Implementation Link (contains Dataset and Jupyter Notebook):
Topics learned and implemented:
- NumPy
- Matplotlib
- Pandas
- Exploratory Data Analysis
- Model building (Finding the minimum Coefficient, predicting outcome)
- Evaluation techniques (Mean Square Error, Accuracy)
Today I started reading O'Reilly's Learning Spark by Jules S. Damji, Brooke Wenig, Tathagata Das, and Denny Lee ( The book covers an in-depth knowledge of Spark concepts and implementation with PySpark.
Topics learned and implemented on a big data set today:
- Basic Spark concepts: Transformation, Action, Lazy Evaluation
- Spark APIs
- Built-in data sources: Spark SQL tables and views, Spark Dataframes
I also worked on Kaggle problem ( to analyse a dataset with One Million Kaggle Questions.
Here is the Kaggle Link:
I have also uploaded the file in kaggle -
For the past six days, I enrolled myself in DataStax's Apache Cassandra Developer path curriculum courses. As a part of the DataStax's Developer path curriculum (, I completed the below courses.
- DS101: Introduction to Apache Cassandra
- DS201: Foundations of Apache Cassandra™ and DataStax Enterprise
- DS220: Data Modeling with Apache Cassandra™ and DataStax Enterprise
These courses are made up of video lessons and hands-on exercises that helped me to implement different use-cases. It was an exciting journey to learn and implement Cassandra's data modeling for several use-cases. In the relational world, we are always focused on #normalization but designing the database with the respect to the #queries (workflow of the application) has led Cassandra to achieve the highest data access rate of O(1).
I have completed the Apache Cassandra 3 Developer Associate Certification, which is a proctored online exam conducted by DataStax.