Skip to content

Latest commit

 

History

History
122 lines (84 loc) · 5.82 KB

README.md

File metadata and controls

122 lines (84 loc) · 5.82 KB

30DaysOfData

Data is always amazing. If we process them properly, we can obtain great insights. This repository contains my day-to-day activities and learnings of my Data Analytics and Machine Learning Journey. Before recording my experience, here are the skillsets I already possess with respect to Data Analytics and ML:

  • Python
  • Hadoop MapReduce (with both JAVA and Python)
  • Working with Spark RDDs and PySpark
  • Elastic MapReduce (AWS)

Day 1

Learning Math behind algorithms is an essential part of learning Machine Learning.

I started reading Mathematics for Machine Learning book by Marc Peter Deisenroth, A. Aldo Faisal, and Cheng Soon Ong. In order to get a visual representation of the topics, I also watched the Essence of Linear Algebra Essential series by 3Blue1Brown (https://youtube.com/playlist?list=PLZHQObOWTQDPD3MizzM2xVFitgF8hE_ab).

As Albert Einstein had quoted, "If you can't explain it simply, you don't understand it well enough", so I have written an article explaining the topics that I have learned today in a simple manner.

Article Link: https://swaathi317.medium.com/math-for-machine-learning-part-1-582419c00932

Topics covered in the article:

  1. Scalars
  2. Vectors
    • Basis vectors
    • Span of vectors
    • Linear Dependency
  3. Matrix
    • Matrix multiplication
    • Inverse of matrix
    • Matrix Transpose
  4. Inner Products

I also enrolled myself into the Machine Learning course taught by Andrew NG (https://www.coursera.org/learn/machine-learning). I completed the first week's portions today. I also converted my learning notes into an article.

Article Link: https://swaathi317.medium.com/linear-regression-with-one-variable-b5f59f92ab22

Topics learned today and covered in the article:

  1. Univariate Linear Regression
  2. Cost/Loss function (with one variable)
  3. Gradient Descent (with one variable)

Day 2

I enrolled in a Introduction to Big Data course(https://www.coursera.org/learn/big-data-introduction) today to learn more about addressing the questions of why we need big data and how a big data strategy is formed. I completed the course and I have also converted my personal notes from the course into a Medium article.

Article Link: https://swaathi317.medium.com/big-data-an-introduction-b7bc048081c9

Topics covered in the article:

  1. Introduction to Big Data
  2. Who needs Big data solutions
  3. Building a Big Data strategy for organizations
  4. Steps in the Data Science process

Day 3

I learned about Multivariate Linear Regression today and also completed Week-2 of the Machine Learning course taught by Andrew NG (https://www.coursera.org/learn/machine-learning). I have also written an article regarding Multivariate Linear regression in Medium with my understanding.

Article Link: https://swaathi317.medium.com/multivariate-linear-regression-1c06b12cb982

Topics covered in the article:

  1. Multivariate Linear Regression
  2. Gradient Descent for multiple variables
    • How to check if gradient descent is working properly?
    • Feature Scaling
    • Mean Normalization
    • Features and Polynomial Regression
  3. Normal Equation
    • Normal Equation and Non-invertibility

I also implemented a Univariate Regression Model using the Salary Dataset (https://www.kaggle.com/karthickveerakumar/salary-data-simple-linear-regression), in order to predict the salary of employees.

Implementation Link (contains Dataset and Jupyter Notebook): https://github.com/swaathi317/30DaysOfData/tree/main/Univariate%20Linear%20Regression

Topics learned and implemented:

  • NumPy
  • Matplotlib
  • Pandas
  • Exploratory Data Analysis
  • Model building (Finding the minimum Coefficient, predicting outcome)
  • Evaluation techniques (Mean Square Error, Accuracy)

Day 4

Today I started reading O'Reilly's Learning Spark by Jules S. Damji, Brooke Wenig, Tathagata Das, and Denny Lee (https://pages.databricks.com/rs/094-YMS-629/images/LearningSpark2.0.pdf). The book covers an in-depth knowledge of Spark concepts and implementation with PySpark.

Topics learned and implemented on a big data set today:

  • Basic Spark concepts: Transformation, Action, Lazy Evaluation
  • Spark APIs
  • Built-in data sources: Spark SQL tables and views, Spark Dataframes

I also worked on Kaggle problem (https://www.kaggle.com/pavellexyr/one-million-reddit-questions) to analyse a dataset with One Million Kaggle Questions.

Here is the Kaggle Link: https://www.kaggle.com/swaathis/factors-that-make-a-question-better

I have also uploaded the file in kaggle - https://github.com/swaathi317/30DaysOfData/blob/main/DataAnalysis/What%20makes%20a%20good%20reddit%20question.ipynb

Day 5, Day 6, Day 7, Day 8, Day 9, Day 10

For the past six days, I enrolled myself in DataStax's Apache Cassandra Developer path curriculum courses. As a part of the DataStax's Developer path curriculum (https://lnkd.in/gb5NX2h6), I completed the below courses.

  1. DS101: Introduction to Apache Cassandra
  2. DS201: Foundations of Apache Cassandra™ and DataStax Enterprise
  3. DS220: Data Modeling with Apache Cassandra™ and DataStax Enterprise

Screenshot 2021-11-15 093723

These courses are made up of video lessons and hands-on exercises that helped me to implement different use-cases. It was an exciting journey to learn and implement Cassandra's data modeling for several use-cases. In the relational world, we are always focused on #normalization but designing the database with the respect to the #queries (workflow of the application) has led Cassandra to achieve the highest data access rate of O(1).

I have completed the Apache Cassandra 3 Developer Associate Certification, which is a proctored online exam conducted by DataStax.

Cassandra developer certification