30DaysOfData

Data is always amazing. If we process them properly, we can obtain great insights. This repository contains my day-to-day activities and learnings of my Data Analytics and Machine Learning Journey. Before recording my experience, here are the skillsets I already possess with respect to Data Analytics and ML:

Python
Hadoop MapReduce (with both JAVA and Python)
Working with Spark RDDs and PySpark
Elastic MapReduce (AWS)

Day 1

Learning Math behind algorithms is an essential part of learning Machine Learning.

I started reading Mathematics for Machine Learning book by Marc Peter Deisenroth, A. Aldo Faisal, and Cheng Soon Ong. In order to get a visual representation of the topics, I also watched the Essence of Linear Algebra Essential series by 3Blue1Brown (https://youtube.com/playlist?list=PLZHQObOWTQDPD3MizzM2xVFitgF8hE_ab).

As Albert Einstein had quoted, "If you can't explain it simply, you don't understand it well enough", so I have written an article explaining the topics that I have learned today in a simple manner.

Article Link: https://swaathi317.medium.com/math-for-machine-learning-part-1-582419c00932

Topics covered in the article:

Scalars
Vectors
- Basis vectors
- Span of vectors
- Linear Dependency
Matrix
- Matrix multiplication
- Inverse of matrix
- Matrix Transpose
Inner Products

I also enrolled myself into the Machine Learning course taught by Andrew NG (https://www.coursera.org/learn/machine-learning). I completed the first week's portions today. I also converted my learning notes into an article.

Article Link: https://swaathi317.medium.com/linear-regression-with-one-variable-b5f59f92ab22

Topics learned today and covered in the article:

Univariate Linear Regression
Cost/Loss function (with one variable)
Gradient Descent (with one variable)

Day 2

I enrolled in a Introduction to Big Data course(https://www.coursera.org/learn/big-data-introduction) today to learn more about addressing the questions of why we need big data and how a big data strategy is formed. I completed the course and I have also converted my personal notes from the course into a Medium article.

Article Link: https://swaathi317.medium.com/big-data-an-introduction-b7bc048081c9

Topics covered in the article:

Introduction to Big Data
Who needs Big data solutions
Building a Big Data strategy for organizations
Steps in the Data Science process

Day 3

I learned about Multivariate Linear Regression today and also completed Week-2 of the Machine Learning course taught by Andrew NG (https://www.coursera.org/learn/machine-learning). I have also written an article regarding Multivariate Linear regression in Medium with my understanding.

Article Link: https://swaathi317.medium.com/multivariate-linear-regression-1c06b12cb982

Topics covered in the article:

Multivariate Linear Regression
Gradient Descent for multiple variables
- How to check if gradient descent is working properly?
- Feature Scaling
- Mean Normalization
- Features and Polynomial Regression
Normal Equation
- Normal Equation and Non-invertibility

I also implemented a Univariate Regression Model using the Salary Dataset (https://www.kaggle.com/karthickveerakumar/salary-data-simple-linear-regression), in order to predict the salary of employees.

Implementation Link (contains Dataset and Jupyter Notebook): https://github.com/swaathi317/30DaysOfData/tree/main/Univariate%20Linear%20Regression

Topics learned and implemented:

NumPy
Matplotlib
Pandas
Exploratory Data Analysis
Model building (Finding the minimum Coefficient, predicting outcome)
Evaluation techniques (Mean Square Error, Accuracy)

Day 4

Today I started reading O'Reilly's Learning Spark by Jules S. Damji, Brooke Wenig, Tathagata Das, and Denny Lee (https://pages.databricks.com/rs/094-YMS-629/images/LearningSpark2.0.pdf). The book covers an in-depth knowledge of Spark concepts and implementation with PySpark.

Topics learned and implemented on a big data set today:

Basic Spark concepts: Transformation, Action, Lazy Evaluation
Spark APIs
Built-in data sources: Spark SQL tables and views, Spark Dataframes

I also worked on Kaggle problem (https://www.kaggle.com/pavellexyr/one-million-reddit-questions) to analyse a dataset with One Million Kaggle Questions.

Here is the Kaggle Link: https://www.kaggle.com/swaathis/factors-that-make-a-question-better

I have also uploaded the file in kaggle - https://github.com/swaathi317/30DaysOfData/blob/main/DataAnalysis/What%20makes%20a%20good%20reddit%20question.ipynb

Day 5, Day 6, Day 7, Day 8, Day 9, Day 10

For the past six days, I enrolled myself in DataStax's Apache Cassandra Developer path curriculum courses. As a part of the DataStax's Developer path curriculum (https://lnkd.in/gb5NX2h6), I completed the below courses.

DS101: Introduction to Apache Cassandra
DS201: Foundations of Apache Cassandra™ and DataStax Enterprise
DS220: Data Modeling with Apache Cassandra™ and DataStax Enterprise

These courses are made up of video lessons and hands-on exercises that helped me to implement different use-cases. It was an exciting journey to learn and implement Cassandra's data modeling for several use-cases. In the relational world, we are always focused on #normalization but designing the database with the respect to the #queries (workflow of the application) has led Cassandra to achieve the highest data access rate of O(1).

I have completed the Apache Cassandra 3 Developer Associate Certification, which is a proctored online exam conducted by DataStax.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

30DaysOfData

Day 1

Day 2

Day 3

Day 4

Day 5, Day 6, Day 7, Day 8, Day 9, Day 10

Files

README.md

Latest commit

History

README.md

File metadata and controls

30DaysOfData

Day 1

Day 2

Day 3

Day 4

Day 5, Day 6, Day 7, Day 8, Day 9, Day 10