In [1]:
import pandas as pd
import re

# Data Cleaning for Challenge 2 - Seasonality
In this notebook I clean the data to keep only the data that:
- is related to questions, and
- is in English.

The initial dataset had 20304843 rows containing a question-answer pair per row. Three dataframes are exported at the end of this notebook:
- `EN_questions` (under variable _q_df_ through the notebook) contains 2218481 rows. It includes all questions data in English from Kenya and Uganda.
- `ke_EN_questions`contains 1290839 rows. It includes questions data in English from Kenya.
- `ug_EN_questions`contains 927642 rows. It includes questions data in English from Uganda.

All three datasets contain the following 7 columns:
- question_id
- user_id
- country
- topics
- text
- clean_text
- date

In [2]:
df = pd.read_csv("data/raw/pd_dataset.csv")

In [3]:
df.columns

Index(['question_id', 'question_user_id', 'question_language',
       'question_content', 'question_topic', 'question_sent', 'response_id',
       'response_user_id', 'response_language', 'response_content',
       'response_topic', 'response_sent', 'question_user_type',
       'question_user_status', 'question_user_country_code',
       'question_user_gender', 'question_user_dob', 'question_user_created_at',
       'response_user_type', 'response_user_status',
       'response_user_country_code', 'response_user_gender',
       'response_user_dob', 'response_user_created_at'],
      dtype='object')

In [4]:
df.shape

(20304843, 24)

## Create questions dataframe

I'm choosing only question columns containing relevant data for the analysis. \
All answer columns and the following question columns have been excluded:
- question_user_type (all users are farmers)
- question_user_status
- question_user_created_at
- question_language (removed after choosing to keep only questions in English)

In [5]:
q_columns = ['question_id', 'question_user_id', 'question_language',
             'question_content', 'question_topic', 'question_sent', 
             'question_user_country_code', 'question_user_gender', 
             'question_user_dob']

q_df = df[q_columns][df[q_columns]['question_language']=='eng'].copy()

q_df.drop(columns=['question_language'], inplace=True)

In [6]:
q_df.head()

Unnamed: 0,question_id,question_user_id,question_content,question_topic,question_sent,question_user_country_code,question_user_gender,question_user_dob
1,3849061,521327,Q this goes to wefarm. is it possible to get f...,,2017-11-22 12:25:05+00,ug,,
9,3849084,6642,Q-i have stock rabbit's urine for 5 weeks mash...,rabbit,2017-11-22 12:25:10+00,ke,,
15,3849098,526375,Q J Have Mi 10000 Can J Start Aproject Of Pout...,poultry,2017-11-22 12:25:12+00,ug,,
16,3849100,237506,WHERE DO I GET SEEDS OF COCONUT?,pig,2017-11-22 12:25:12+00,ke,,
17,3849100,237506,WHERE DO I GET SEEDS OF COCONUT?,coconut,2017-11-22 12:25:12+00,ke,,


In [7]:
q_df.shape

(11976781, 8)

## Clean null values

In [8]:
q_df.isna().sum()

question_id                          0
question_user_id                     0
question_content                     0
question_topic                 1605642
question_sent                        0
question_user_country_code           0
question_user_gender          11497292
question_user_dob             11085031
dtype: int64

- Since most user's gender and date of birth values are null, I'll remove these columns entirely.

In [9]:
q_df.drop(columns=['question_user_gender', 'question_user_dob'], inplace=True)

- Removing also rows where the topic is missing.

In [10]:
q_df.dropna(subset=['question_topic'], inplace=True)

In [11]:
q_df.shape

(10371139, 6)

## Clean duplicates

The original dataset contains a question-answer pair per row, so our questions dataframe now has many duplicate questions. Let's remove all exact duplicates:

In [12]:
q_df.duplicated().sum()

7708246

In [13]:
q_df.drop_duplicates(inplace=True)

In [14]:
q_df.shape

(2662893, 6)

## Group topics

We can see that, even after removing duplicate rows, some questions appear more than once. That is because some questions have more than one topic. \
E.g. question with `question_id` 3849100 appears twice, once with `question_topic` "pig", and then "coconut".

In [15]:
q_df[q_df['question_id']==3849100]

Unnamed: 0,question_id,question_user_id,question_content,question_topic,question_sent,question_user_country_code
16,3849100,237506,WHERE DO I GET SEEDS OF COCONUT?,pig,2017-11-22 12:25:12+00,ke
17,3849100,237506,WHERE DO I GET SEEDS OF COCONUT?,coconut,2017-11-22 12:25:12+00,ke


In [16]:
print(f"The original dataset has {q_df['question_topic'].nunique()} topics, including:\n\n{list(q_df['question_topic'].unique())}")

The original dataset has 148 topics, including:

['rabbit', 'poultry', 'pig', 'coconut', 'plant', 'tomato', 'animal', 'potato', 'coffee', 'onion', 'chicken', 'tree', 'cattle', 'cassava', 'pigeon', 'banana', 'kale', 'passion-fruit', 'wheat', 'maize', 'cabbage', 'crop', 'spinach', 'turkey', 'rice', 'bean', 'paw-paw', 'sheep', 'butternut-squash', 'livestock', 'greens', 'pumpkin', 'watermelon', 'plantain', 'olive', 'vegetable', 'tobacco', 'sugar-cane', 'avocado', 'goat', 'sweet-potato', 'beetroot', 'bee', 'capsicum', 'grass', 'mango', 'macademia', 'millet', 'melon', 'pear', 'jackfruit', 'dog', 'cowpea', 'nightshade', 'bird', 'cotton', 'flax', 'apple', 'cocoa', 'pineapple', 'garlic', 'duck', 'sunflower', 'cereal', 'orange', 'miraa', 'carrot', 'guava', 'tea', 'fish', 'tilapia', 'safflower', 'napier-grass', 'peanut', 'cat', 'collard-greens', 'french-bean', 'lettuce', 'aubergine', 'yam', 'oat', 'soya', 'mung-bean', 'clover', 'strawberry', 'pea', 'rapeseed', 'radish', 'taro', 'cucumber', 'eucal

I'm creating a new column including all topics for a question, so each question appears only in one row.

In [17]:
df_topics = (q_df.groupby('question_id')['question_topic']
               .apply(lambda x: tuple(set(x.dropna())))
               .reset_index())

q_df = q_df.drop_duplicates('question_id').drop(columns=['question_topic']).merge(df_topics, on='question_id')

In [18]:
q_df.head()

Unnamed: 0,question_id,question_user_id,question_content,question_sent,question_user_country_code,question_topic
0,3849084,6642,Q-i have stock rabbit's urine for 5 weeks mash...,2017-11-22 12:25:10+00,ke,"(rabbit,)"
1,3849098,526375,Q J Have Mi 10000 Can J Start Aproject Of Pout...,2017-11-22 12:25:12+00,ug,"(poultry,)"
2,3849100,237506,WHERE DO I GET SEEDS OF COCONUT?,2017-11-22 12:25:12+00,ke,"(pig, coconut)"
3,3849129,54426,Q#.Which plant has omega3?,2017-11-22 12:25:16+00,ke,"(plant,)"
4,3849153,340091,Q Am Jackson From Ibanda If Want To Grow Tomat...,2017-11-22 12:25:18+00,ug,"(tomato,)"


In [19]:
print(f"There are now {q_df['question_topic'].nunique()} combinations of topics in the question_topic column")

There are now 14454 combinations of topics in the question_topic column


In [20]:
for topic_group in q_df['question_topic'].unique()[:15]:
    print(topic_group)

('rabbit',)
('poultry',)
('pig', 'coconut')
('plant',)
('tomato',)
('animal', 'potato')
('coffee',)
('coffee', 'plant')
('onion',)
('chicken',)
('pig',)
('tree',)
('cattle',)
('pigeon', 'cassava')
('banana',)


## Grouping question types with NLP

While the topics provided by the original dataset provide information on the specific animals, plants, etc., we can't infer from them if the question is connected to season-affected processes/issues like planting, pests, sickness, harvesting, or selling. NLP could help us identify these types through analysis of the question text. To be able to use NLP in future steps, I'm creating a column with clean text.

In [21]:
def clean_text(text):
    """ Changes all text to lower cases, removes URLs,
    keeps only letters and spaces, normalizes spaces."""
    text = str(text).lower()
    text = re.sub(r"http\S+|www\S+|https\S+", "", text)
    text = re.sub(r"[^a-z\s]", "", text)
    text = re.sub(r"\s+", " ", text).strip()
    return text

q_df['clean_text'] = q_df['question_content'].apply(clean_text)

## Edit and reorder column names 

In [22]:
q_df = q_df.rename(columns={
    'question_user_id': 'user_id',
    'question_content': 'text',
    'question_sent': 'date',
    'question_user_country_code': 'country',
    'question_topic': 'topics',
})

In [23]:
q_df = q_df[['question_id', 'user_id', 'country', 'topics', 'text', 'clean_text', 'date']]

## Filter countries and create dataframes per country

In [25]:
q_df['country'].value_counts()

country
ke    1290839
ug     927642
gb         61
tz          2
Name: count, dtype: int64

As we see, Tanzania has barely values left after cleaning. This is due to language restrictions, since most questions coming from Tanzania in the original dataset were not English:

In [26]:
print(f"""Tanzania language value counts in the original dataset:
{df[df['question_user_country_code']=='tz']['question_language'].value_counts()}""")

Tanzania language value counts in the original dataset:
question_language
swa    4233714
eng         12
Name: count, dtype: int64


I will keep only Kenya and Uganda.

In [27]:
q_df = q_df[q_df['country'].isin(['ke', 'ug'])]

In [28]:
q_df_ke = q_df[q_df['country']=='ke']
q_df_ug = q_df[q_df['country']=='ug']

The final shape of the dataframes:

In [34]:
q_df.shape

(2218481, 7)

In [35]:
q_df_ke.shape

(1290839, 7)

In [36]:
q_df_ug.shape

(927642, 7)

## Export dataframes

In [37]:
# Dataframe with all questions in English from Kenya and Uganda together
q_df.to_csv("data/clean/EN_questions.csv", index=False)

In [38]:
# Dataframe with all questions in English from Kenya
q_df_ke.to_csv("data/clean/ke_EN_questions.csv", index=False)

In [39]:
# Dataframe with all questions in English from Uganda
q_df_ug.to_csv("data/clean/ug_EN_questions.csv", index=False)