Skip to content

ramapriyakp/template

 
 

Repository files navigation

A Self-Taught Data Scientist

Hi, my name is Harrison Jansma.

At the beginning of 2018 I was in a new city at a dead end job. After some serious self-reflection about my passions and interests, I committed to learning everything I could about machine learning, data science, and the tech industry. I built my own curriculum, a hodgepodge of hundreds of websites and forums, and committed between 4-6 hours each day to learn the basics.

Fast forward to today where I have enrolled in an excellent computer science program. I now publish my own blog posts to help others learn about everything from data science to deep learning. I am excited by all the things I have learned over the last few months, and I love that I get to work on something I am passionate about each day.

I still have a long way to go, though. The next few months I will focus on teaching others and implementing machine learning models in real world production systems. I further hope to get internship experience so I can engage with other like-minded people and get more experience building intelligent systems for use in the real-world.

Get to Know Me

My Portfolio Website

I built the site in HTML, CSS, and Javascript using pieces of an existing design on Colorlib. Though I am not interested in Front-End Development, I created and deployed this website on a private DigitalOcean server so that I could learn more about web app design and back-end development. I believe that knowledge of both data science and back-end development is the secret to a seamless implementation of artificial intelligence in existing web technologies.

In the future, I will use this website as a nesting ground for web-based computer vision and NLP models. Though I don't expect it to be anything more than a portfolio site, I strongly suspect that these skills will be crucial to technological development in the years to come.


My Writing

All my life I have been an avid sci-fi and fantasy reader. Very recently I have begun writing about my research, thoughts, and experiences and publishing them for the world to see. Though my creative output is a work in progress, my blog posts about data science and deep learning have been well received.

I have been published in multiple major data-science and analytics publications; including freeCodeCamp (500k subscribers), Towards Data Science, and KDNuggets. I have also built my own following of 1500 subscribers.


My Professional Experience

I am twenty-three years old and enrolled in a masters program, I have had little industry experience. This lack of access to a production environment has been the largest road-block in my curriculum. However, I think I have found a solution to this issue that is unique to the tech industry.

Thanks to the availability of cheap computational resources, I have found that it is possible for an individual to create their own production systems. In the last few months, I have deployed servers on DigitalOcean to host my website, build machine learning environments, and house MySQL databases. All of these are practical applications relevant to the day to day of a data scientist or machine learning engineer.

My Work


Medium is a blogging platform where writers and readers share their ideas. This purpose of this project was to give Medium writers a benchmark to measure their own performance, as well as a goal that might increase the rankings of their stories in Medium's recommendation engine. With more than two hundred thousand writers in my dataset, this project has the potential to ease the creative process for thousands, and increase the quality of Medium's stories for its readers.

By collecting data on one million Medium stories, I was able to analyze the performance Medium's articles. As a result of this project, I found that the top 1% of Medium articles receive two thousand claps. Authors can use this metric as a goal when writing future stories. By achieving the top 1% of claps, a writer's story is more likely to stand out to Medium's recommendation engine, and as a result, reach new and diverse audiences.

The results of my analysis, along with an extensive exploratory data analysis of Medium, can be found in this repository.

I also wrote a story detailing my findings in Medium's largest tech publication, freeCodeCamp (496k subscribers). The full article can be found here. I then published the full data-set for public use by Medium's data-science community. All 1.4 million data points are freely available on Kaggle. My introductory article, describing the dataset and how I collected it, can be found here.

October 10, 2018


This experiment tests whether convolutional neural networks with dropout or batch normalization are more performant in image recognition tasks. The notebook in this repository is experimental evidence supporting the Medium post I wrote explaining how to more effectively build convolutional neural networks.

The above blog post has been published and featured in Towards Data Science, with 3K reads on Medium in 2 weeks. It has also been reposted as a guest blog on KDNuggets, a leading site on Analytics, Big Data, Data Science, and Machine Learning, reaching over 500K unique visitors per month and over 230K subscribers/followers via email and social media.

August 15, 2018

Object Localization on my dog, Huckleberry
Object Localization Featuring my dog, Huckleberry.

In this project, I implemented the deep learning method for object localization (finding objects in an image) proposed in this research paper. I improved code written by Alexis Cook to handle multi-class localization of images.

Computer vision has innumerable real-world applications. This project was my introduction to the world of computer vision research. Since the conclusion of this project, I have focused heavily on researching recent advances in convolutional neural network architectures. Furthermore, I have made an emphasis on learning how to apply these concepts using Tensorflow and Keras.

July 16, 2018


This project was motivated by my drive to learn about the best practices of predictive modeling in text data. In the write-up, I cleaned and vectorized Twitter data, visualized and examined patterns, and created a linear classifier to predict document sentiment with 89% accuracy on a validation set.

In the future, I would like to productionize this NLP model by creating a REST API to allow others access to my predictions.

June 20, 2018


In this project, I used unsupervised learning to cluster forum discussions. Specifically, I performed LDA clustering on Wikipedia forum comments to see if I could isolate clusters of toxic comments.(insults, slurs,...)

I was successful in isolating toxic comments into one group. Furthermore, I gained valuable knowledge about the discussions held within the forum dataset, labeling forum posts into nine distinct categories. These nine categories could be further grouped as either relevant discussion, side conversations, or outright toxic comments.

June 13, 2018

In this write-up I sought to answer whether a survey of mental health benefits of tech industry employees could be used to cluster employees into groups with good and bad mental health coverage.

By cleaning survey data and performing an exploratory data analysis I was able to analyze the demographics of the tech industry. I found the average respondent was a male, aged 35, located in the United States. By performing KMeans and agglomerative clustering (with scikit-learn) I attempted to cluster the data but found the survey setup prevented any meaningful insight into the data.

In completing this project, I learned how to encode categorical data and create an insightful EDA with great visualizations. I also learned how to implement clustering methods on data, and analyze the appropriateness of the clustering method with various techniques.

May 23, 2018


In this write-up, I sought to practice the entire data science lifecycle. This includes defining project end goals, data cleaning, exploratory data analysis, model comparisons, and model tuning.

After a brief EDA, I visualized the Titanic dataset via a 2D projection. I then compared several machine learning algorithms and found the most accurate model to be a Gradient Boosted Machine. After a model tuning phase I increased model accuracy from 77% to 79%.

May 3, 2018

Mini-Projects

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 96.7%
  • HTML 3.3%