Data is always amazing. If we process them properly, we can obtain great insights. This repository contains my day-to-day activities and learnings of my Data Analytics and Machine Learning Journey. Before recording my experience, here are the skillsets I already possess with respect to Data Analytics and ML:
- Python
- Hadoop MapReduce (with both JAVA and Python)
- Working with Spark RDDs and PySpark
- Elastic MapReduce (AWS)
Learning Math behind algorithms is an essential part of learning Machine Learning.
I started reading Mathematics for Machine Learning book by Marc Peter Deisenroth, A. Aldo Faisal, and Cheng Soon Ong. In order to get a visual representation of the topics, I also watched the Essence of Linear Algebra Essential series by 3Blue1Brown (https://youtube.com/playlist?list=PLZHQObOWTQDPD3MizzM2xVFitgF8hE_ab).
As Albert Einstein had quoted, "If you can't explain it simply, you don't understand it well enough", so I have written an article explaining the topics that I have learned today in a simple manner.
Article Link: https://swaathi317.medium.com/math-for-machine-learning-part-1-582419c00932
Topics covered in the article:
- Scalars
- Vectors
- Basis vectors
- Span of vectors
- Linear Dependency
- Matrix
- Matrix multiplication
- Inverse of matrix
- Matrix Transpose
- Inner Products
I also enrolled myself into the Machine Learning course taught by Andrew NG (https://www.coursera.org/learn/machine-learning). I completed the first week's portions today. I also converted my learning notes into an article.
Article Link: https://swaathi317.medium.com/linear-regression-with-one-variable-b5f59f92ab22
Topics learned today and covered in the article:
- Univariate Linear Regression
- Cost/Loss function (with one variable)
- Gradient Descent (with one variable)
I enrolled in a Introduction to Big Data course(https://www.coursera.org/learn/big-data-introduction) today to learn more about addressing the questions of why we need big data and how a big data strategy is formed. I completed the course and I have also converted my personal notes from the course into a Medium article.
Article Link: https://swaathi317.medium.com/big-data-an-introduction-b7bc048081c9
Topics covered in the article:
- Introduction to Big Data
- Who needs Big data solutions
- Building a Big Data strategy for organizations
- Steps in the Data Science process
I learned about Multivariate Linear Regression today and also completed Week-2 of the Machine Learning course taught by Andrew NG (https://www.coursera.org/learn/machine-learning). I have also written an article regarding Multivariate Linear regression in Medium with my understanding.
Article Link: https://swaathi317.medium.com/multivariate-linear-regression-1c06b12cb982
Topics covered in the article:
- Multivariate Linear Regression
- Gradient Descent for multiple variables
- How to check if gradient descent is working properly?
- Feature Scaling
- Mean Normalization
- Features and Polynomial Regression
- Normal Equation
- Normal Equation and Non-invertibility
I also implemented a Univariate Regression Model using the Salary Dataset (https://www.kaggle.com/karthickveerakumar/salary-data-simple-linear-regression), in order to predict the salary of employees.
Implementation Link (contains Dataset and Jupyter Notebook): https://github.com/swaathi317/30DaysOfData/tree/main/Univariate%20Linear%20Regression
Topics learned and implemented:
- NumPy
- Matplotlib
- Pandas
- Exploratory Data Analysis
- Model building (Finding the minimum Coefficient, predicting outcome)
- Evaluation techniques (Mean Square Error, Accuracy)
Today I started reading O'Reilly's Learning Spark by Jules S. Damji, Brooke Wenig, Tathagata Das, and Denny Lee (https://pages.databricks.com/rs/094-YMS-629/images/LearningSpark2.0.pdf). The book covers an in-depth knowledge of Spark concepts and implementation with PySpark.
Topics learned and implemented on a big data set today:
- Basic Spark concepts: Transformation, Action, Lazy Evaluation
- Spark APIs
- Built-in data sources: Spark SQL tables and views, Spark Dataframes
I also worked on Kaggle problem (https://www.kaggle.com/pavellexyr/one-million-reddit-questions) to analyse a dataset with One Million Kaggle Questions.
Here is the Kaggle Link: https://www.kaggle.com/swaathis/factors-that-make-a-question-better
I have also uploaded the file in kaggle - https://github.com/swaathi317/30DaysOfData/blob/main/DataAnalysis/What%20makes%20a%20good%20reddit%20question.ipynb
For the past six days, I enrolled myself in DataStax's Apache Cassandra Developer path curriculum courses. As a part of the DataStax's Developer path curriculum (https://lnkd.in/gb5NX2h6), I completed the below courses.
- DS101: Introduction to Apache Cassandra
- DS201: Foundations of Apache Cassandra™ and DataStax Enterprise
- DS220: Data Modeling with Apache Cassandra™ and DataStax Enterprise
These courses are made up of video lessons and hands-on exercises that helped me to implement different use-cases. It was an exciting journey to learn and implement Cassandra's data modeling for several use-cases. In the relational world, we are always focused on #normalization but designing the database with the respect to the #queries (workflow of the application) has led Cassandra to achieve the highest data access rate of O(1).
I have completed the Apache Cassandra 3 Developer Associate Certification, which is a proctored online exam conducted by DataStax.