Data Science, Engineering and Analytics Projects and Solutions

This portfolio consists of several data science and anaytics projects, concepts, tools and resources illustrating the work I have done in order to further develop my data science skills.

Drive Customer Success: Supercharging Safaricom’s Product FAQs with Llama 2 Model
Web Scraping Product-Driven Question-Answer Pairs
Harnessing GPT-Powered AI to Query and Summarize Multiple Hansard Reports in the Kenyan Parliament
Uncovering Patterns and Trends in Ausgrid Power Outage Data
Timeseries Modelling for Sales Data
Olympics Data Analytics from (1896 - 2020)
Inside DP Ruto’s Social Network Engine (Part 1)
Online Sharepoint 365 Uploader
Financial Crime Fraud Analytics
Political Content Sentiment Analyzer
Fitbit Data Analytics during COVID-19 Lockdown in New Zealand
NLP: Tweeting patterns during COVID-19 Lockdown in New Zealand
Sea Turtle Rescue Forecast Challenge
Bidirectional Encoder Representations from Transformers (BERT) in computation of a few African Cities Happiness
Te Papa Tongarewa (Museum of New Zealand) Sentimental Footing
We are just a Loquacious lot. 2019 Kenyan Social Beat

6. Data Science and Engineering Certifications

7. Publications

Projects

Python

Drive Customer Success: Supercharging Safaricom’s Product FAQs with Llama 2 Model

Transforming Customer Experiences: How a Finetuned Llama 2 Model Can Empower Product FAQs:

Link to Article: with all code

I detail how I finetuned an open-source Llama 2 model with Safaricom’s product and service-related FAQ question and answer pairs. Models such as Llama 2 possess the capability to predict the subsequent token within a sequence. However, this predictive ability alone does not render them highly effective virtual assistants, as they do not inherently respond to explicit instructions. To bridge this gap, a technique known as instruction tuning is applied to align their responses more closely with human expectations. I made use of Supervised Fine-Tuning (SFT) with the FAQ pairs. In this approach, models are subjected to training on a dataset consisting of paired instructions and corresponding responses, as in our case with the Question-Answer pairs. The goal is to optimize the internal model parameters within the LLM to minimize the disparity between the generated answers and the ground-truth responses, which serve as reference labels.

Web Scraping Product-Driven Question-Answer Pairs

Link to Article: with all code

This is part one of a two-part series where I build a scraper to get most FAQs about Safaricom products to use later on in fine-tuning an open-source Llama 2 Large Language Model on the data and eventually developing a chatbot that users could interact with the fine-tuned model.

I used the following Python packages:

BeautifulSoup: to parse HTML and XML documents, making it easier to extract information from web pages.
Selenium: to automate interactions with the website. It’s particularly useful for scraping dynamic content and interacting with JavaScript-driven pages.
Pandas: to manipulate and store the data.
Random: to add random delays between requests to avoid overloading the server.

I was able to scrape and store 1759 non-null product-related FAQs and their answers here https://www.safaricom.co.ke/media-center-landing/frequently-asked-questions

Harnessing GPT-Powered AI to Query and Summarize Multiple Hansard Reports in the Kenyan Parliament

Link to Article: with all code

The Hansard and Audio Services Directorate within the Kenyan Parliament is responsible for recording and producing verbatim reports of parliamentary proceedings and committee deliberations. With a curiosity to understand the topics discussed by members of parliament over time, I sought to explore the Hansard reports for specific sessions or sittings. However, due to the length of these reports and the challenge of identifying relevant dates, this endeavor proved to be time-consuming and potentially unproductive.

This led me to explore effective methods of querying PDF documents and obtaining insightful information on specific topics. After considering various options, I decided to leverage Large Language Models (LLMs), with OpenAI being my preferred choice.

The analysis follows the below : -

Sourcing data: Extracting PDFs from the official website, as they are publicly available.
PDFs Validity Check
Setting up dependencies: Configuring the necessary software libraries and tools.
Querying the Hansard reports for 2018: Although there is no particular significance attributed to the 2018 reports, I chose this subset for demonstration purposes.
Summarizing PDFs

Uncovering Patterns and Trends in Ausgrid Power Outage Data]

Link to Notebook
Link to Article

Power outages are a prevalent challenge encountered by utility companies, highlighting the need for a thorough analysis of historical data to understand patterns and trends. I analysed the historical outage data for Ausgrid, Australia’s largest electricity distributor, which services 1.7 million customers across Sydney, the Hunter Valley, and the Central Coast.

In summary:

The analysis shows that equipment faults have consistently been a major reason for power outages in the Ausgrid network.
The year 2020 recorded the highest number of outages during the period covered by the dataset.
Based on the data, Gosford, Hornsby, and Wyong were the locations most affected by power outages. Gosford’s cause of outages is mostly environmental-related factors.
Power outages were found to be more prevalent in the afternoons, with a peak at around 6 PM across all days of the week. As such, Ausgrid’s rostering for the afternoon shift should consider having more workers on standby. Mornings and late evenings are usually quieter periods in terms of power outages.
The analysis indicates that power outages are more prevalent on Saturdays. Therefore, there is a need for better workforce planning on this day.

Sales Data Timeseries Modelling

Link to Notebook

This is a prediction problem based on a time-series dataset of online sales of a UK-based store. The company sells unique all-occasion giftware. Wholesalers make up a high number of their customers. The sales data is from 01/12/2009 to 09/12/2011. The problem here is to predict the sales for the next 22 days based on this historical data as the owner is interested in knowing the expected revenue at this time to be sure of the sports car he buys his partner for Christmas.

Dataset Dataset has 1067371 sales records. Each record is identified by 8 attributes i.e. Invoice, StockCode, Description, Quantity, InvoiceDate, Price, Customer ID and Country . Individual descriptions are found here https://www.kaggle.com/mashlyn/online-retail-ii-uci#

Dataset name: online_retail_II.csv and can be found here https://www.kaggle.com/mashlyn/online-retail-ii-uci. I could not directly upload it here due to the 25MB size limitation.

What the Notebook Covers:

Ingesting the dataset
Perform Exploratory Data Analysis (EDA). This includes operations related to: - a) Total daily, weekly, and monthly sales volumes. b) Last months’ revenue share by product and by customer. c) Weighted average monthly sale price by volume
Data Cleaning and Encoding
Data Modelling (Using Facebook's Prophet)in relation to time series-based revenue prediction.

Olympics Data Analytics from (1896 - 2020)

Link to Notebook

EDA and analytics on a historical dataset on the modern Olympic Games, including all the Games from Athens 1896 to Tokyo in 2020. The data was scraped from www.sports-reference.com.

Objective: To visualize how Olympics has evolved over time with special emphasis on African countries that began participating quite many years after the competitions began. This is achieved by merging and visualizing output from the above datasets.

Inside DP Rutos Social Network

Link

Social network mining for users within Kenya's Deputy President's Twitter account. Three significant weaknesses are in this network setup: -

Isolated users — Isolates in the network, more so around @WilliamSRuto’s cluster are many. This means that they are likely to miss out what for example @MbuiMumbi or @oleitumbi disseminates, unless is re-shared or by @WilliamSRuto which may not always be the case. This is depicted by the low Reciprocated Vertex Pair Ratio.
Weak inter and intra cluster edges — Connections between clusters are weak, less for G1 to G5. This means content in the clusters is less likely to reach all users in it. The situation is even worse for inter-cluster connections.
Influence isolation — @oleitumbi is the only user of influence in this collection period. The user is a prime target for account suspension e.g. if someone reports of any policy violations. This is depicted by the low graph density value.

Sharepoint 365 Uploader

Repository

Tool designed and developed using Python and Streamlit to help you upload files to an online Sharepoint location. This works with Sharepoint 365 but can be modified to fit earlier SharePoint versions. Current functionality includes:

Specifying the folder path to the files to be uploaded (Source URL).
Summary information of the files to be uploaded.
Specification of Sharepoint login and related upload details.
Creation of a folder based on the todays date format in the base URL that is user specified.
Upload of the files matching the specified extension (currently .xlsx) to the new folder in the base URL. File format can be changed

Notes on Usage

A deployed version of the app can be found here https://sharepointuploader.herokuapp.com/. The app can also be cloned and run locally using streamlit: streamlit run SharepointUploader.py. When doing this, ensure you have the required modules listed in the requirements file.
Make sure the account details for accessing Sharepoint on your domain are valid. Normally, the username is your domain specific email and password.

Bugs, Enhancements and Comments

All comments, bug reports and enhancement requests are welcome. To do so, please submit a new issue and I will work hard on improving the app.

Future Functionality

Future functionality will likely include:

Option to specify file formats to be uploaded in a folder with mixed file types.
Email trigger to the username once the files are all uploaded.

Financial Crime Fraud Analytics

Repository | Notebook | nbviewer

Descriptive and Predictive Analytics for a Synthetic dataset on Financial crimes.
The dataset https://www.kaggle.com/ntnu-testimon/paysim1/download is a synthetic one i.e. simulated using PaySim based on a sample of real transactions extracted from one month of financial logs from a mobile money service implemented in an African country. The dataset is scaled down 1/4 of the original dataset.
Used sweetviz package for Exploratory Data Analysis (EDA).
Identified the most probable fraud indicators.
XGBoost and RandomForest Classifiers with Area under the precision-recall curve (AUPRC) as the metric for the skewed dataset.

Conclusion:

Fraud detection is a difficult process. This is especially compounded by the lack of integral data in the area.
Tree based algorithms worked better in detection of fraud. This is partly attributed to the nature of data.

Political Content Sentiment Analyzer

In this project, I setup a tweets collection framework for tweets belonging to five politicians in Kenya. I analyzed the tweet sentiments/emotions over time, packaged the same in a Streamlit App and hosted the same on Heroku.

Code | Deployed App

Fitbit Data Analytics during COVID-19 Lockdown in New Zealand

Repository | Notebook | nbviewer | Blog Article

Analytics of my body, activity and sleep data during the COVID-19 lockdown.
Identification of important factors that necessitated weight loss during the lockdown time.

NLP: Tweeting patterns during COVID-19 Lockdown in New Zealand

Repository | Notebook | nbviewer

Collection of streaming tweets from Auckland and Wellington, New Zealand's largest cities during COVID-19 Lockdown period.
Descriptive analytics on the tweeting patterns by users from the two cities. Aucklanders seemed to work more than tweet.
Topics of discussion were semantically identical across the cities. Visualized by PyLDAVis.

Sea Turtle Rescue Forecast Challenge

Repository | Notebook | nbviewer

The objective of the competition was to create a machine learning model to help Kenyan non-profit organization Local Ocean Conservation anticipate the number of turtles they will rescue from each of their rescue sites as part of their By-Catch Release Programme https://zindi.africa/competitions/sea-turtle-rescue-forecast-challenge.

Descriptive analytics and EDA for the dataset. Included encoding etc for better modelling.
1.1897261428182493 RMSE as the measurement metric.

Bidirectional Encoder Representations from Transformers (BERT) in computation of a few African Cities Happiness

Repository | Notebook | nbviewer | Blog Article

Collection of tweets from the different cities via geocoding.
Translation via GoogleTranslate Python library for modelling.
Descriptive analytics of the datasets per country.
Modelling via BERT and batch sentiment prediction per tweet and grouped by cities.
The highest probability to a sentiment was assumed to be the true sentiment of the tweet.

Te Papa Tongarewa (Museum of New Zealand) Sentimental Footing

Article

Code Snippets with:-

Descriptive Analytics of Trip Advisor reviews for Museum of New Zealand (Te Papa Tongarewa)
Sentiment Analysis of the reviews.

We are just a Loquacious lot. 2019 Kenyan Social Beat

Article

The project was an anlytical piece about what Kenyans really discuss online. Data in form of tweets was from January to December 2019.

Questions of Interest:-

Are we able to deduce the nature of Kenyans based on their daily chatter? Do they talk about substantive issues?
Are they topically consistent in their talk over time?

R

1. Emotion Detection in New Zealander's Tweets Over the COVID-19 Lockdown Period

Repository | Code

Descriptive Analytics of tweets geolocated to New Zealand.
Emotions Analysis in the collected tweets

2. Speech Analytics: State of the Nation Addresses by Kenya's President, Uhuru Kenyatta

Repository | Code Blog Article

Descriptive Analytics of Uhuru Kenyatta's State of the Nation speeches from 2014 - 2019.
The language in Uhuru Kenyatta's State of the Nation addresses from 2014-2019 never changed.
His 2015's speech was the most difficult in terms of readability i.e. needed someone at postgraduate/ advanced undergraduate level to read and probably understand.
His 2019 speech was the most positive of all his speeches