<a href="https://colab.research.google.com/github/vee-22/victoriaWeb3/blob/main/DataMining_project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 1. Business Understanding

## 1.1 Problem Statement
Wikipedia content about *Zambia* remains significantly underrepresented, with very few comprehensive pages dedicated to Zambian topics, history, culture, and current affairs. The *DataLab Research group* at The University of Zambia has identified that understanding the knowledge levels and expertise of current Wikipedia contributors working on Zambian content is crucial for developing targeted strategies to improve content quality and quantity.

Currently, there is no systematic way to assess the expertise and knowledge depth of contributors to Zambian Wikipedia pages, making it difficult to:

- Identify subject matter experts who could mentor new contributors  
- Match contributors with appropriate content areas based on their expertise  
- Develop targeted training programs for contributors with different knowledge levels  
- Optimize content review and quality assurance processes  

*Core Problem:* How can we systematically classify and understand the knowledge levels of Wikipedia contributors working on Zambian content to improve the overall quality and quantity of Wikipedia pages about Zambia?

# New Section

## 1.2 Business Objectives

### Primary Objectives:
- *Enhance Content Quality:* Improve the accuracy, depth, and comprehensiveness of Wikipedia articles about Zambia by better understanding contributor expertise levels  
- *Optimize Contributor Engagement:* Develop targeted strategies to engage contributors based on their knowledge levels and areas of expertise  
- *Facilitate Knowledge Transfer:* Identify expert contributors who can mentor newcomers and provide guidance on complex topics  
- *Improve Content Coverage:* Strategically assign content creation and improvement tasks based on contributor knowledge levels  



 Secondary Objectives:
- **Build Community Capacity:** Create a framework for assessing and developing Wikipedia contributor skills within the Zambian context  
- ***Inform Training Programs:*** Develop evidence-based training curricula tailored to different contributor knowledge levels  
- ***Support Academic Research:*** Provide insights into digital knowledge creation patterns in developing countries  

### Success Metrics:
- *Short-term:* Accurate classification of contributor knowledge levels with measurable confidence scores  
- *Medium-term:* Increased quality scores of Zambian Wikipedia articles (measured by completeness, references, and peer ratings)  
- ***Long-term:*** Growth in the number of active Zambian Wikipedia contributors and articles about Zambia  





##**1.4 Project Success Criteria**

###Model Performance Criteria
- *Primary Metric:* Classification accuracy of at least *80%* on a held-out test set  
- *Precision and Recall:* Minimum *75%* precision and recall for each knowledge level category  
- *F1-Score:* Overall weighted F1-score of at least *0.78*  
- *Cross-validation Stability:* Model performance should not vary by more than *5%* across different data splits  

###Business Impact Criteria
- *Expert Validation:* At least *85%* agreement between model classifications and manual assessments by domain experts  
- *Actionability:* Classification results must provide clear, actionable insights for contributor development strategies  
- *Scalability:* Model should be able to process new contributors and updates efficiently (*< 1 minute* per contributor)  

###Technical Criteria
- *Interpretability:* Model decisions must be explainable with feature importance scores and decision reasoning  
- *Robustness:* Model should maintain performance when applied to contributors from different time periods  
- *Generalizability:* Framework should be adaptable to other developing country contexts or subject areas  

###Data Quality Criteria
- *Coverage:* Successfully classify at least *90%* of active contributors to Zambian Wikipedia pages  
- *Consistency:* Inter-rater reliability (Kappa score) *> 0.7* for manual validation labels  
- *Completeness:* Less than *10%* missing data across critical features  

###Deployment Criteria
- *Integration:* Model should be implementable within existing Wikipedia contributor management workflows  
- *User Acceptance:* Positive feedback from at least *80%* of Wikipedia administrators and experienced contributors who test the system  
- *Maintenance:* Established process for model updates and performance monitoring over time

##**1.3 Data Mining Goals**

Primary Data Mining Goal:
Build a *multi-class classification model* that can automatically categorize Wikipedia contributors working on Zambian content into distinct knowledge level categories (e.g., Novice, Intermediate, Advanced, Expert) based on their contribution patterns, edit quality, and content expertise.

###Specific Technical Objectives

Feature Engineering
Extract meaningful features from contributor data including:
- Edit frequency and consistency patterns  
- Quality indicators (citations added, grammar improvements, factual accuracy)  
- Content complexity (technical terms, specialized knowledge areas)  
- Collaboration patterns (talk page participation, peer interactions)  
- Longevity and commitment indicators  

###Classification Model Development
- Implement and compare multiple classification algorithms (Random Forest, SVM, Neural Networks, etc.)  
- Develop ensemble methods to improve classification accuracy  
- Create interpretable models that explain why a contributor is classified at a particular level  

###Clustering Analysis (Secondary)
- Identify natural groupings of contributors based on their expertise areas and contribution styles  
- Discover contributor archetypes that may not align with traditional knowledge level categories  

###Predictive Modeling
- Predict potential contributor knowledge growth trajectories  
- Identify contributors likely to become long-term, high-quality contributors

# Data Understanding


## 2. Data Understanding  

### 2.1 Dataset Description  

The dataset was created by collecting contributor information from Wikipedia articles under the category **"Zambia"** (limited to the first 20 pages). Using the Wikipedia API, we retrieved contributor revision histories for each page, then aggregated the data at the contributor level.  

Each row in the dataset represents a **unique contributor (editor)** who has made revisions to one or more Zambia-related pages.  

### 2.2 Features  

The dataset contains the following columns:  

- **Contributor**: The username or IP address of the editor.  
- **Total_Edits**: The total number of edits the contributor made across the sampled pages.  
- **Pages_Edited**: The number of distinct Wikipedia pages (from the Zambia category) the contributor edited.  
- **Total_Size**: The cumulative size (in bytes) of all contributions made by the contributor.  
- **First_Edit**: The timestamp of the contributor’s earliest recorded edit in this dataset.  
- **Last_Edit**: The timestamp of the contributor’s latest recorded edit in this dataset.  
- **Active_Days**: The number of days between the first and last edit (a measure of contributor longevity).  
- **Activity_Span_Years**: The activity span expressed in years (rounded to 2 decimal places).  
- **Edits_per_Month**: Average edits per month, calculated over the contributor’s active period.  

### 2.3 Dataset Characteristics  

- **Unit of Analysis**: Individual contributors.  
- **Scope**: Limited to the first 20 pages in the Wikipedia "Zambia" category (not all Zambia-related articles on Wikipedia).  
- **Nature of Data**: Real-world, retrieved dynamically from the Wikipedia API at the time of execution.  
- **Size**: The number of rows depends on how many unique contributors appear in the sampled articles.  

This dataset provides insights into **editing behavior**, including how active contributors are, how many pages they engage with, and the overall scale of their contributions. It forms the foundation for exploring patterns in Wikipedia editing and contributor dynamics in a domain-specific context.  


In [None]:
!pip install requests pandas




In [None]:
import pandas as pd
df = pd.read_csv("zambia_contributors_summary.csv")
# 1. Display first few rows
print("First 10 rows:")
display(df.head(10))



First 10 rows:


Unnamed: 0,Contributor,Total_Edits,Pages_Edited,Total_Size,First_Edit,Last_Edit,Active_Days,Activity_Span_Years,Edits_per_Month
0,Kosmopolitní,3,1,455820,2025-03-21 08:05:23+00:00,2025-08-13 06:32:53+00:00,144,0.39,0.62
1,102.208.220.201,1,1,151954,2025-08-11 15:35:23+00:00,2025-08-11 15:35:23+00:00,0,0.0,1.0
2,Guest88,1,1,151954,2025-08-11 13:18:40+00:00,2025-08-11 13:18:40+00:00,0,0.0,1.0
3,Remsense,5,1,755201,2025-01-23 16:22:18+00:00,2025-08-10 00:53:36+00:00,198,0.54,0.76
4,2600:8800:311B:4100:9892:C8B6:AFF1:68F7,1,1,152156,2025-08-06 18:28:23+00:00,2025-08-06 18:28:23+00:00,0,0.0,1.0
5,2001:4479:900A:2F00:78FD:7EE:28B7:95FA,1,1,152155,2025-08-04 23:49:33+00:00,2025-08-04 23:49:33+00:00,0,0.0,1.0
6,Tumbuka Arch,2,1,304021,2025-05-21 15:53:08+00:00,2025-07-15 20:12:05+00:00,55,0.15,1.09
7,Lightoil,1,1,151978,2025-07-14 23:43:14+00:00,2025-07-14 23:43:14+00:00,0,0.0,1.0
8,GeographicAccountant,1,1,151990,2025-07-14 20:10:30+00:00,2025-07-14 20:10:30+00:00,0,0.0,1.0
9,102.149.117.245,1,1,151996,2025-07-14 18:20:23+00:00,2025-07-14 18:20:23+00:00,0,0.0,1.0


## 2. Data Understanding

### 2.1 First Look at the Data (.head())
The .head() method displays the first 5 rows of the dataset. This gives us a quick preview of how the data is structured, including column names and sample values. From the output, we can observe:

- The dataset contains both numerical and categorical columns.  
- Some columns may contain missing values (NaN).  
- The data appears to follow the expected structure, but some inconsistencies (such as unusual values) may need further cleaning.  

---

### 2.2 Dataset Dimensions (.shape)
Using .shape, we can see the dataset has:

- *X rows* (records/observations)  
- *Y columns* (features/attributes)  

This helps us understand the dataset’s size and the number of attributes we need to analyze.

---

### 2.3 Data Structure and Types (.info())
The .info() output provides an overview of the dataset’s column data types and non-null counts. From this, we observed that:

- Some columns are *integers* or *floats* (likely numerical features).  
- Others are *object/string* type (categorical features such as names, IDs, or categories).  
- A few columns contain *missing values*, which may require imputation or removal later.  
- The dataset is mostly consistent, but further preprocessing may be needed to standardize data types.  

---

### 2.4 Summary Statistics (.describe())
The .describe() method summarizes numerical columns with measures such as:

- *Count* → Number of non-missing entries  
- *Mean* → Average value  
- *Std* → Standard deviation (spread of values)  
- *Min / Max* → Range of values  
- *25%, 50%, 75%* → Quartiles showing data distribution  

From the summary statistics, we can note:

- Some columns have very large ranges, possibly indicating *outliers*.  
- Certain numerical attributes are *skewed*, meaning the distribution is not symmetric.  
- Differences in scales (e.g., prices vs. IDs) suggest we may need *normalization/standardization* later in preprocessing.

In [None]:

import pandas as pd


df = pd.read_csv("zambia_contributors_summary.csv")




print("First 10 rows:")
display(df.head(10))


print("\nDataFrame Info:")
df.info()


print("\nDescriptive Statistics:")
display(df.describe())


print("\nDataset shape:", df.shape)


print("\nMissing values per column:")
display(df.isnull().sum())


print("\nNumber of unique contributors:", df['Contributor'].nunique())

FileNotFoundError: [Errno 2] No such file or directory: 'zambia_contributors_summary.csv'