Skip to content

wandabwa2004/DS_Projects

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

83 Commits
 
 

Repository files navigation

Data Science, Engineering and Analytics Projects and Solutions

This portfolio consists of several data science and anaytics projects, concepts, tools and resources illustrating the work I have done in order to further develop my data science skills.

Python versions GitHub Medium Data  Science  Blog

Table of Contents

1. Python

2. R

3. SAS

4. Visualizations (PowerBI, Tableau, ScatterText)

5. Model Deployment

Projects

Python

Transforming Customer Experiences: How a Finetuned Llama 2 Model Can Empower Product FAQs:

Link to Article: with all code

I detail how I finetuned an open-source Llama 2 model with Safaricom’s product and service-related FAQ question and answer pairs. Models such as Llama 2 possess the capability to predict the subsequent token within a sequence. However, this predictive ability alone does not render them highly effective virtual assistants, as they do not inherently respond to explicit instructions. To bridge this gap, a technique known as instruction tuning is applied to align their responses more closely with human expectations. I made use of Supervised Fine-Tuning (SFT) with the FAQ pairs. In this approach, models are subjected to training on a dataset consisting of paired instructions and corresponding responses, as in our case with the Question-Answer pairs. The goal is to optimize the internal model parameters within the LLM to minimize the disparity between the generated answers and the ground-truth responses, which serve as reference labels.

Link to Article: with all code

This is part one of a two-part series where I build a scraper to get most FAQs about Safaricom products to use later on in fine-tuning an open-source Llama 2 Large Language Model on the data and eventually developing a chatbot that users could interact with the fine-tuned model.

I used the following Python packages:

  1. BeautifulSoup: to parse HTML and XML documents, making it easier to extract information from web pages.
  2. Selenium: to automate interactions with the website. It’s particularly useful for scraping dynamic content and interacting with JavaScript-driven pages.
  3. Pandas: to manipulate and store the data.
  4. Random: to add random delays between requests to avoid overloading the server.

I was able to scrape and store 1759 non-null product-related FAQs and their answers here https://www.safaricom.co.ke/media-center-landing/frequently-asked-questions

Link to Article: with all code

The Hansard and Audio Services Directorate within the Kenyan Parliament is responsible for recording and producing verbatim reports of parliamentary proceedings and committee deliberations. With a curiosity to understand the topics discussed by members of parliament over time, I sought to explore the Hansard reports for specific sessions or sittings. However, due to the length of these reports and the challenge of identifying relevant dates, this endeavor proved to be time-consuming and potentially unproductive.

This led me to explore effective methods of querying PDF documents and obtaining insightful information on specific topics. After considering various options, I decided to leverage Large Language Models (LLMs), with OpenAI being my preferred choice.

The analysis follows the below : -

  1. Sourcing data: Extracting PDFs from the official website, as they are publicly available.
  2. PDFs Validity Check
  3. Setting up dependencies: Configuring the necessary software libraries and tools.
  4. Querying the Hansard reports for 2018: Although there is no particular significance attributed to the 2018 reports, I chose this subset for demonstration purposes.
  5. Summarizing PDFs

Link to Notebook
Link to Article

Power outages are a prevalent challenge encountered by utility companies, highlighting the need for a thorough analysis of historical data to understand patterns and trends. I analysed the historical outage data for Ausgrid, Australia’s largest electricity distributor, which services 1.7 million customers across Sydney, the Hunter Valley, and the Central Coast.

In summary:

  1. The analysis shows that equipment faults have consistently been a major reason for power outages in the Ausgrid network.
  2. The year 2020 recorded the highest number of outages during the period covered by the dataset.
  3. Based on the data, Gosford, Hornsby, and Wyong were the locations most affected by power outages. Gosford’s cause of outages is mostly environmental-related factors.
  4. Power outages were found to be more prevalent in the afternoons, with a peak at around 6 PM across all days of the week. As such, Ausgrid’s rostering for the afternoon shift should consider having more workers on standby. Mornings and late evenings are usually quieter periods in terms of power outages.
  5. The analysis indicates that power outages are more prevalent on Saturdays. Therefore, there is a need for better workforce planning on this day.

Link to Notebook

This is a prediction problem based on a time-series dataset of online sales of a UK-based store. The company sells unique all-occasion giftware. Wholesalers make up a high number of their customers. The sales data is from 01/12/2009 to 09/12/2011. The problem here is to predict the sales for the next 22 days based on this historical data as the owner is interested in knowing the expected revenue at this time to be sure of the sports car he buys his partner for Christmas.

Dataset Dataset has 1067371 sales records. Each record is identified by 8 attributes i.e. Invoice, StockCode, Description, Quantity, InvoiceDate, Price, Customer ID and Country . Individual descriptions are found here https://www.kaggle.com/mashlyn/online-retail-ii-uci#

Dataset name: online_retail_II.csv and can be found here https://www.kaggle.com/mashlyn/online-retail-ii-uci. I could not directly upload it here due to the 25MB size limitation.

What the Notebook Covers:

  1. Ingesting the dataset
  2. Perform Exploratory Data Analysis (EDA). This includes operations related to: - a) Total daily, weekly, and monthly sales volumes. b) Last months’ revenue share by product and by customer. c) Weighted average monthly sale price by volume
  3. Data Cleaning and Encoding
  4. Data Modelling (Using Facebook's Prophet)in relation to time series-based revenue prediction.

Sales Data  Timeseries Modelling

Link to Notebook

EDA and analytics on a historical dataset on the modern Olympic Games, including all the Games from Athens 1896 to Tokyo in 2020. The data was scraped from www.sports-reference.com.

Objective: To visualize how Olympics has evolved over time with special emphasis on African countries that began participating quite many years after the competitions began. This is achieved by merging and visualizing output from the above datasets.

Olympics Data  Analytics

Link

Social network mining for users within Kenya's Deputy President's Twitter account. Three significant weaknesses are in this network setup: -

  • Isolated users — Isolates in the network, more so around @WilliamSRuto’s cluster are many. This means that they are likely to miss out what for example @MbuiMumbi or @oleitumbi disseminates, unless is re-shared or by @WilliamSRuto which may not always be the case. This is depicted by the low Reciprocated Vertex Pair Ratio.
  • Weak inter and intra cluster edges — Connections between clusters are weak, less for G1 to G5. This means content in the clusters is less likely to reach all users in it. The situation is even worse for inter-cluster connections.
  • Influence isolation — @oleitumbi is the only user of influence in this collection period. The user is a prime target for account suspension e.g. if someone reports of any policy violations. This is depicted by the low graph density value.

DP Ruto's  Twitter Social Network Engine

Repository

Tool designed and developed using Python and Streamlit to help you upload files to an online Sharepoint location. This works with Sharepoint 365 but can be modified to fit earlier SharePoint versions. Current functionality includes:

  • Specifying the folder path to the files to be uploaded (Source URL).
  • Summary information of the files to be uploaded.
  • Specification of Sharepoint login and related upload details.
  • Creation of a folder based on the todays date format in the base URL that is user specified.
  • Upload of the files matching the specified extension (currently .xlsx) to the new folder in the base URL. File format can be changed

Sharepoint Uploader

Notes on Usage

  • A deployed version of the app can be found here https://sharepointuploader.herokuapp.com/. The app can also be cloned and run locally using streamlit: streamlit run SharepointUploader.py. When doing this, ensure you have the required modules listed in the requirements file.
  • Make sure the account details for accessing Sharepoint on your domain are valid. Normally, the username is your domain specific email and password.

Bugs, Enhancements and Comments

All comments, bug reports and enhancement requests are welcome. To do so, please submit a new issue and I will work hard on improving the app.

Future Functionality

Future functionality will likely include:

  • Option to specify file formats to be uploaded in a folder with mixed file types.
  • Email trigger to the username once the files are all uploaded.

Repository | Notebook | nbviewer

  • Descriptive and Predictive Analytics for a Synthetic dataset on Financial crimes.
  • The dataset https://www.kaggle.com/ntnu-testimon/paysim1/download is a synthetic one i.e. simulated using PaySim based on a sample of real transactions extracted from one month of financial logs from a mobile money service implemented in an African country. The dataset is scaled down 1/4 of the original dataset.
  • Used sweetviz package for Exploratory Data Analysis (EDA).
  • Identified the most probable fraud indicators.
  • XGBoost and RandomForest Classifiers with Area under the precision-recall curve (AUPRC) as the metric for the skewed dataset.

Correlation Plot for Different Factors in Financial Crime

Conclusion:

  1. Fraud detection is a difficult process. This is especially compounded by the lack of integral data in the area.
  2. Tree based algorithms worked better in detection of fraud. This is partly attributed to the nature of data.

In this project, I setup a tweets collection framework for tweets belonging to five politicians in Kenya. I analyzed the tweet sentiments/emotions over time, packaged the same in a Streamlit App and hosted the same on Heroku.

Code | Deployed App

Repository | Notebook | nbviewer | Blog Article

  • Analytics of my body, activity and sleep data during the COVID-19 lockdown.
  • Identification of important factors that necessitated weight loss during the lockdown time. Fitbit Data Analytics

Repository | Notebook | nbviewer

  • Collection of streaming tweets from Auckland and Wellington, New Zealand's largest cities during COVID-19 Lockdown period.
  • Descriptive analytics on the tweeting patterns by users from the two cities. Aucklanders seemed to work more than tweet.
  • Topics of discussion were semantically identical across the cities. Visualized by PyLDAVis.

Tweeting  Patterns during  COVID-19 Lockdown

Repository | Notebook | nbviewer

The objective of the competition was to create a machine learning model to help Kenyan non-profit organization Local Ocean Conservation anticipate the number of turtles they will rescue from each of their rescue sites as part of their By-Catch Release Programme https://zindi.africa/competitions/sea-turtle-rescue-forecast-challenge.

  • Descriptive analytics and EDA for the dataset. Included encoding etc for better modelling.
  • 1.1897261428182493 RMSE as the measurement metric.

Turtles Capture and  Release Programme

Repository | Notebook | nbviewer | Blog Article

  • Collection of tweets from the different cities via geocoding.
  • Translation via GoogleTranslate Python library for modelling.
  • Descriptive analytics of the datasets per country.
  • Modelling via BERT and batch sentiment prediction per tweet and grouped by cities.
  • The highest probability to a sentiment was assumed to be the true sentiment of the tweet.

BERT Happiness Index

Article

Code Snippets with:-

  • Descriptive Analytics of Trip Advisor reviews for Museum of New Zealand (Te Papa Tongarewa)
  • Sentiment Analysis of the reviews.

Sentiment Over Time

Article

The project was an anlytical piece about what Kenyans really discuss online. Data in form of tweets was from January to December 2019.

Questions of Interest:-

  1. Are we able to deduce the nature of Kenyans based on their daily chatter? Do they talk about substantive issues?
  2. Are they topically consistent in their talk over time?

Nairobi City from the Space Station

R

Repository | Code

  • Descriptive Analytics of tweets geolocated to New Zealand.
  • Emotions Analysis in the collected tweets

Emotions Distribution

Repository | Code Blog Article

  • Descriptive Analytics of Uhuru Kenyatta's State of the Nation speeches from 2014 - 2019.
  • The language in Uhuru Kenyatta's State of the Nation addresses from 2014-2019 never changed.
  • His 2015's speech was the most difficult in terms of readability i.e. needed someone at postgraduate/ advanced undergraduate level to read and probably understand.
  • His 2019 speech was the most positive of all his speeches

Polarity in the speeches over time

SAS

Repository | Code | Dataset | Analytics Output

Visualizations (PowerBI, Tableau, ScatterText)

Code

  • COVID-19 Numbers by 10/03/2020 in PowerBI.

COVID-19 Numbers as at 10/03/2020

Repository

Used Scattertext package in Python to interactively visualize reviews related to the Museum of New Zealand.

Visualization of Terms Code | nbviewer

[Visualization of Topics Code] (https://lnkd.in/gzvNBwF) | nbviewer

Terms visualizations  in ScatterText

Model Deployment

Code | Deployed App

Political Sentiment Analyzer

Data Science Foundations

About

Data Science Portfolio

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published