In [1]:
import pandas as pd
import re

# Data Cleaning for Challenge 2 - Seasonality
In this notebook I clean the data to keep only the data that:
- is related to questions, and
- is in English.

The initial dataset had 20304843 rows containing a question-answer pair per row. Three dataframes are exported at the end of this notebook:
- `EN_questions` (under variable _q_df_ through the notebook) contains 2218481 rows. It includes all questions data in English from Kenya and Uganda.
- `ke_EN_questions`contains 1290839 rows. It includes questions data in English from Kenya.
- `ug_EN_questions`contains 927642 rows. It includes questions data in English from Uganda.

All three datasets contain the following 8 columns:
- question_id
- user_id
- country
- topics
- text
- clean_text
- date
- season

In [2]:
df = pd.read_csv("data/raw/pd_dataset.csv")

In [3]:
df.columns

Index(['question_id', 'question_user_id', 'question_language',
       'question_content', 'question_topic', 'question_sent', 'response_id',
       'response_user_id', 'response_language', 'response_content',
       'response_topic', 'response_sent', 'question_user_type',
       'question_user_status', 'question_user_country_code',
       'question_user_gender', 'question_user_dob', 'question_user_created_at',
       'response_user_type', 'response_user_status',
       'response_user_country_code', 'response_user_gender',
       'response_user_dob', 'response_user_created_at'],
      dtype='object')

In [4]:
df.shape

(20304843, 24)

## Create questions dataframe

I'm choosing only question columns containing relevant data for the analysis. \
All answer columns and the following question columns have been excluded:
- question_user_type (all users are farmers)
- question_user_status
- question_user_created_at
- question_language (removed after choosing to keep only questions in English)

In [5]:
q_columns = ['question_id', 'question_user_id', 'question_language',
             'question_content', 'question_topic', 'question_sent', 
             'question_user_country_code', 'question_user_gender', 
             'question_user_dob']

q_df = df[q_columns][df[q_columns]['question_language']=='eng'].copy()

q_df.drop(columns=['question_language'], inplace=True)

In [6]:
q_df.head()

Unnamed: 0,question_id,question_user_id,question_content,question_topic,question_sent,question_user_country_code,question_user_gender,question_user_dob
1,3849061,521327,Q this goes to wefarm. is it possible to get f...,,2017-11-22 12:25:05+00,ug,,
9,3849084,6642,Q-i have stock rabbit's urine for 5 weeks mash...,rabbit,2017-11-22 12:25:10+00,ke,,
15,3849098,526375,Q J Have Mi 10000 Can J Start Aproject Of Pout...,poultry,2017-11-22 12:25:12+00,ug,,
16,3849100,237506,WHERE DO I GET SEEDS OF COCONUT?,pig,2017-11-22 12:25:12+00,ke,,
17,3849100,237506,WHERE DO I GET SEEDS OF COCONUT?,coconut,2017-11-22 12:25:12+00,ke,,


In [7]:
q_df.shape

(11976781, 8)

## Clean null values

In [8]:
q_df.isna().sum()

question_id                          0
question_user_id                     0
question_content                     0
question_topic                 1605642
question_sent                        0
question_user_country_code           0
question_user_gender          11497292
question_user_dob             11085031
dtype: int64

- Since most user's gender and date of birth values are null, I'll remove these columns entirely.

In [9]:
q_df.drop(columns=['question_user_gender', 'question_user_dob'], inplace=True)

- Removing also rows where the topic is missing.

In [10]:
q_df.dropna(subset=['question_topic'], inplace=True)

In [11]:
q_df.shape

(10371139, 6)

## Clean duplicates

The original dataset contains a question-answer pair per row, so our questions dataframe now has many duplicate questions. Let's remove all exact duplicates:

In [12]:
q_df.duplicated().sum()

7708246

In [13]:
q_df.drop_duplicates(inplace=True)

In [14]:
q_df.shape

(2662893, 6)

## Group topics

We can see that, even after removing duplicate rows, some questions appear more than once. That is because some questions have more than one topic. \
E.g. question with `question_id` 3849100 appears twice, once with `question_topic` "pig", and then "coconut".

In [15]:
q_df[q_df['question_id']==3849100]

Unnamed: 0,question_id,question_user_id,question_content,question_topic,question_sent,question_user_country_code
16,3849100,237506,WHERE DO I GET SEEDS OF COCONUT?,pig,2017-11-22 12:25:12+00,ke
17,3849100,237506,WHERE DO I GET SEEDS OF COCONUT?,coconut,2017-11-22 12:25:12+00,ke


In [16]:
print(f"The original dataset has {q_df['question_topic'].nunique()} topics, including:\n\n{list(q_df['question_topic'].unique())}")

The original dataset has 148 topics, including:

['rabbit', 'poultry', 'pig', 'coconut', 'plant', 'tomato', 'animal', 'potato', 'coffee', 'onion', 'chicken', 'tree', 'cattle', 'cassava', 'pigeon', 'banana', 'kale', 'passion-fruit', 'wheat', 'maize', 'cabbage', 'crop', 'spinach', 'turkey', 'rice', 'bean', 'paw-paw', 'sheep', 'butternut-squash', 'livestock', 'greens', 'pumpkin', 'watermelon', 'plantain', 'olive', 'vegetable', 'tobacco', 'sugar-cane', 'avocado', 'goat', 'sweet-potato', 'beetroot', 'bee', 'capsicum', 'grass', 'mango', 'macademia', 'millet', 'melon', 'pear', 'jackfruit', 'dog', 'cowpea', 'nightshade', 'bird', 'cotton', 'flax', 'apple', 'cocoa', 'pineapple', 'garlic', 'duck', 'sunflower', 'cereal', 'orange', 'miraa', 'carrot', 'guava', 'tea', 'fish', 'tilapia', 'safflower', 'napier-grass', 'peanut', 'cat', 'collard-greens', 'french-bean', 'lettuce', 'aubergine', 'yam', 'oat', 'soya', 'mung-bean', 'clover', 'strawberry', 'pea', 'rapeseed', 'radish', 'taro', 'cucumber', 'eucal

I'm creating a new column including all topics for a question, so each question appears only in one row.

In [17]:
df_topics = (q_df.groupby('question_id')['question_topic']
               .apply(lambda x: tuple(set(x.dropna())))
               .reset_index())

q_df = q_df.drop_duplicates('question_id').drop(columns=['question_topic']).merge(df_topics, on='question_id')

In [18]:
q_df.head()

Unnamed: 0,question_id,question_user_id,question_content,question_sent,question_user_country_code,question_topic
0,3849084,6642,Q-i have stock rabbit's urine for 5 weeks mash...,2017-11-22 12:25:10+00,ke,"(rabbit,)"
1,3849098,526375,Q J Have Mi 10000 Can J Start Aproject Of Pout...,2017-11-22 12:25:12+00,ug,"(poultry,)"
2,3849100,237506,WHERE DO I GET SEEDS OF COCONUT?,2017-11-22 12:25:12+00,ke,"(pig, coconut)"
3,3849129,54426,Q#.Which plant has omega3?,2017-11-22 12:25:16+00,ke,"(plant,)"
4,3849153,340091,Q Am Jackson From Ibanda If Want To Grow Tomat...,2017-11-22 12:25:18+00,ug,"(tomato,)"


In [19]:
print(f"There are now {q_df['question_topic'].nunique()} combinations of topics in the question_topic column")

There are now 14464 combinations of topics in the question_topic column


In [20]:
for topic_group in q_df['question_topic'].unique()[:15]:
    print(topic_group)

('rabbit',)
('poultry',)
('pig', 'coconut')
('plant',)
('tomato',)
('potato', 'animal')
('coffee',)
('plant', 'coffee')
('onion',)
('chicken',)
('pig',)
('tree',)
('cattle',)
('pigeon', 'cassava')
('banana',)


## Generating question types column

In [21]:
def clean_text(text):
    """ Changes all text to lower cases, removes URLs,
    keeps only letters and spaces, normalizes spaces."""
    text = str(text).lower()
    text = re.sub(r"http\S+|www\S+|https\S+", "", text)
    text = re.sub(r"[^a-z\s]", "", text)
    text = re.sub(r"\s+", " ", text).strip()
    return text

q_df['clean_text'] = q_df['question_content'].apply(clean_text)

In [41]:
def define_question_types(question_text):
    # Categories dictionary generated with Deepseek
    categories = {
    'planting': [
        'plant', 'seed', 'sow', 'germinate', 'transplant', 'seedling', 
        'germination', 'sowing', 'planting', 'propagate', 'propagation',
        'nursery', 'sapling', 'seedbed', 'direct seed', 'broadcast',
        'row planting', 'spacing', 'plant density', 'intercrop', 'intercropping',
        'companion plant', 'transplanting', 'seed treatment', 'pre-germinate',
        'bed preparation', 'land prep', 'till', 'tillage', 'plough', 'plow',
        'ridge', 'furrow', 'dibble', 'dibbling', 'hole', 'drill', 'drilling',
        'plant hole', 'seed rate', 'planting time', 'planting date',
        'when to plant', 'planting season', 'best time to plant',
        'how many months', 'how long', 'take to grow', 'time to harvest',
        'duration', 'maturity period', 'days to', 'weeks to', 'months to',
        'growth period', 'life cycle', 'cycle', 'timeline',
        'when ready', 'ready when', 'harvest time', 'flowering time',
        'how soon', 'quick growing', 'fast growing', 'slow growing',
        'early maturing', 'late maturing', 'growth rate',
        'variety', 'varieties', 'different types', 'how many varieties',
        'which variety', 'best variety', 'good variety',
        'cassava', 'banana', 'bean', 'potato', 'tomato', 'maize', 'rice', 
        'wheat', 'sorghum', 'millet', 'coconut', 'coffee', 'onion',
        'kale', 'passion', 'passionfruit', 'cabbage', 'spinach', 'pawpaw',
        'paw-paw', 'butternut', 'squash', 'greens', 'pumpkin', 'watermelon',
        'plantain', 'olive', 'vegetable', 'tobacco', 'sugarcane', 'sugar-cane',
        'avocado', 'sweet potato', 'sweet-potato', 'beetroot', 'capsicum',
        'mango', 'macadamia', 'macademia', 'melon', 'pear', 'jackfruit',
        'cowpea', 'nightshade', 'cotton', 'flax', 'apple', 'cocoa',
        'pineapple', 'garlic', 'sunflower', 'cereal', 'orange', 'miraa',
        'carrot', 'guava', 'tea', 'safflower', 'napier', 'napier-grass',
        'peanut', 'collard', 'collard-greens', 'french bean', 'french-bean',
        'lettuce', 'aubergine', 'eggplant', 'yam', 'oat', 'soya', 'soybean',
        'mung bean', 'mung-bean', 'clover', 'strawberry', 'pea', 'rapeseed',
        'radish', 'taro', 'cucumber', 'eucalyptus', 'chilli', 'chili',
        'chard', 'mushroom', 'broccoli', 'pyrethrum', 'pigeon pea', 'pigeon-pea',
        'barley', 'cashew', 'cashew-nut', 'ginger', 'sesame', 'grape',
        'lemon', 'desmodium', 'finger millet', 'finger-millet', 'bamboo',
        'okra', 'chia', 'courgette', 'zucchini', 'celery', 'sudan grass',
        'sudan-grass', 'cauliflower', 'lucern', 'lucerne', 'boma rhodes',
        'boma-rhodes', 'castor', 'castor-bean', 'squash', 'black nightshade',
        'black-nightshade', 'brachiaria', 'brachiaria-grass', 'caliandra',
        'corriander', 'coriander', 'sisal', 'amaranth', 'peach', 'leucaena',
        'gooseberry', 'lupin', 'leek', 'acacia', 'apricot', 'parsley',
        'chickpea', 'snap pea', 'snap-pea', 'blackberry', 'mulberry',
        'asparagus', 'african nightshade', 'african-nightshade', 'snow pea',
        'snow-pea', 'vetch', 'rye', 'setaria', 'purple vetch', 'purple-vetch',
        'cranberry', 'locust bean', 'tree', 'crop', 'grass'
    ],
    
    'pests_diseases': [
        'pest', 'disease', 'insect', 'fungus', 'worm', 'infected', 'spray',
        'infestation', 'aphid', 'caterpillar', 'larva', 'maggot', 'borer',
        'weevil', 'mite', 'nematode', 'whitefly', 'thrip', 'locust',
        'leaf miner', 'stem borer', 'fruit fly', 'blight', 'mildew',
        'rust', 'rot', 'wilt', 'mosaic', 'virus', 'bacterial', 'fungal',
        'control', 'manage', 'treat', 'treatment', 'prevent', 'prevention',
        'pesticide', 'insecticide', 'fungicide', 'herbicide', 'biocontrol',
        'chemical', 'spraying', 'dose', 'application', 
        'bean fly', 'fly',
        'flower', 'flowering', 'bud', 'budding', 'pre-flower',
        'almost to flower', 'before flower', 'during flower',
        'chemical', 'what chemical', 'which chemical', 'apply',
        'dawa', 'medicine', 'drug', 'treatment for',
        'dewarm', 'deworm', 'deworming', 'worming',
        'vaccine', 'vaccination', 'injection', 'inject',
        'rabbit', 'pig', 'poultry', 'chicken', 'turkey', 'sheep', 'goat',
        'cattle', 'dog', 'cat', 'bird', 'duck', 'fish', 'tilapia',
        'guinea pig', 'guinea-pig', 'guinea fowl', 'guinea-fowl',
        'ostrich', 'camel', 'cyprus'
    ],
    
    'harvesting': [
        'harvest', 'ripe', 'mature', 'yield', 'pick', 'collect', 'harvesting',
        'reap', 'reaping', 'pluck', 'gather', 'cut', 'harvest time',
        'harvest date', 'harvest stage', 'maturity', 'maturation', 'ready harvest',
        'harvest indicator', 'harvest method', 'hand pick', 'machine harvest',
        'harvest loss', 'post-harvest', 'storage', 'store', 'cure', 'curing',
        'dry', 'drying', 'sort', 'grade', 'grading', 'pack', 'packing',
        'thresh', 'threshing', 'shell', 'shelling', 'husk', 'husking',
        'mill', 'milling', 'process', 'processing', 'preserve', 'preservation',
        'harvest yield', 'production', 'productivity', 'output',
        'how to harvest', 'when to harvest', 'best time to harvest'
    ],
    
    'soil_fertility': [
        'soil', 'fertilizer', 'manure', 'compost', 'nutrient', 'ph',
        'fertility', 'soil health', 'soil test', 'soil analysis', 'soil type',
        'soil texture', 'soil structure', 'loam', 'clay', 'sandy', 'silt',
        'organic matter', 'humus', 'mulch', 'mulching', 'cover crop',
        'green manure', 'biochar', 'charcoal', 'ash', 'lime', 'liming',
        'gypsum', 'rock phosphate', 'urea', 'npk', 'dap', 'can',
        'micronutrient', 'macronutrient', 'nitrogen', 'phosphorus', 'potassium',
        'calcium', 'magnesium', 'sulfur', 'iron', 'zinc', 'manganese', 'boron',
        'deficiency', 'excess', 'toxicity', 'salinity', 'saline', 'alkaline',
        'acidic', 'acidity', 'alkalinity', 'amend', 'amendment', 'improve soil',
        'restore soil', 'rehabilitate', 'conservation', 'erosion', 'erode',
        'topsoil', 'subsoil', 'plow layer', 'hardpan', 'compaction', 'compact',
        'drainage', 'waterlog', 'waterlogged', 'aeration', 'aerobe', 'anaerobic',
        'apply', 'application', 'when to apply', 'how to apply',
        'during flowering', 'before flowering', 'after flowering'
    ],
    
    'water_management': [
        'water', 'irrigate', 'rain', 'drought', 'irrigation', 'watering',
        'rainfall', 'rainy', 'dry spell', 'dry season', 'wet season',
        'flood', 'flooding', 'waterlog', 'waterlogged', 'drain', 'drainage',
        'drip', 'sprinkler', 'furrow', 'basin', 'flood irrigation',
        'overhead', 'canal', 'channel', 'pipe', 'pump', 'pumping',
        'borehole', 'well', 'river', 'stream', 'dam', 'reservoir', 'pond',
        'tank', 'storage tank', 'rainwater', 'harvest water', 'catchment',
        'runoff', 'infiltration', 'percolation', 'evaporation', 'evapotranspiration',
        'water stress', 'water scarcity', 'water requirement', 'crop water',
        'irrigation schedule', 'irrigation time', 'watering frequency',
        'moisture', 'soil moisture', 'wilting', 'wilt', 'drought tolerant',
        'drought resistant', 'water efficient', 'water saving', 'conservation',
        'how much water', 'when to water', 'frequency of watering'
    ],
    
    'market_info': [
        'price', 'market', 'sell', 'buy', 'cost', 'profit', 'market price',
        'selling price', 'buying price', 'farm gate', 'wholesale', 'retail',
        'export', 'import', 'trader', 'broker', 'middleman', 'agent',
        'auction', 'contract', 'contract farming', 'forward sale',
        'value addition', 'processing', 'packaging', 'labeling', 'branding',
        'certification', 'organic cert', 'fairtrade', 'quality standard',
        'grade', 'grading', 'sorting', 'storage cost', 'transport', 'logistics',
        'market access', 'market information', 'price trend', 'seasonal price',
        'high price', 'low price', 'price fluctuation', 'profit margin',
        'loss', 'break even', 'subsidy', 'loan', 'credit', 'finance',
        'insurance', 'crop insurance', 'record keeping', 'accounting',
        'cost benefit', 'return investment', 'roi', 'economy', 'economic',
        'prize', 'prize of', 'cost of', 'how much', 'what price',
        'kgs', 'kilogram', 'kg', 'per kg', 'per kilogram',
        '50kgs', 'per 50kg', 'bag', 'sack', '90kg bag'
    ],
    
    'animal_care': [
        'feed', 'breed', 'vaccine', 'livestock', 'milk', 'egg', 'animal',
        'cattle', 'cow', 'goat', 'sheep', 'pig', 'poultry', 'chicken',
        'rabbit', 'fish', 'bee', 'apiary', 'honey', 'dairy', 'beef',
        'meat', 'wool', 'hide', 'skin', 'fodder', 'forage', 'pasture',
        'grazing', 'zero graze', 'stall feed', 'concentrate', 'supplement',
        'mineral', 'salt lick', 'water trough', 'shelter', 'housing',
        'pen', 'coop', 'stable', 'byre', 'vaccination', 'deworm',
        'deworming', 'tick control', 'mastitis', 'brucellosis', 'foot rot',
        'newcastle', 'gumboro', 'african swine', 'east coast fever',
        'breeding', 'artificial insemination', 'ai', 'heat detection',
        'pregnancy', 'calving', 'kidding', 'lambing', 'farrowing',
        'hatching', 'incubate', 'incubation', 'wean', 'weaning',
        'growth rate', 'weight gain', 'milk production', 'egg production',
        'manure management', 'slurry', 'biogas', 'draught power',
        'hen', 'layer', 'laying', 'egg laying', 'bad layer', 'good layer',
        'egg production', 'lay eggs', 'egg yield', 'egg output',
        'laying hen', 'layer hen', 'productive layer', 'poor layer',
        'how to know', 'identify layer', 'recognize layer',
        'rooster', 'cock', 'chick', 'chicks', 'broiler', 'layer breed',
        'laying cycle', 'laying period', 'laying capacity',
        'turkey', 'duck', 'bird', 'pigeon', 'dog', 'cat', 'guinea pig',
        'guinea-pig', 'guinea fowl', 'guinea-fowl', 'ostrich', 'camel',
        'tilapia', 'fish', 'bee',  
        'best type', 'best breed', 'which breed', 'good breed',
        'recommended breed', 'suitable breed', 'what breed',
        'choose breed', 'select breed', 'selection of',
        'calf', 'calves', 'young animal', 'baby animal',
        'eats', 'eating', 'consumed', 'swallowed', 'ingested',
        'puppy', 'puppies', 'pet', 'pets', 'small dog', 'young dog',
        'puppy deworming', 'action', 'what action', 'what to do',
        'how to treat', 'sick', 'illness', 'health problem', 'medical',
        'care', 'management', 'husbandry', 'rearing'
    ]
}

    for category, keywords in categories.items():
        if any(keyword in question_text for keyword in keywords):
            return category

    return 'other'

In [42]:
q_df['question_type'] = q_df['clean_text'].apply(define_question_types)

## Edit and reorder column names 

In [22]:
q_df = q_df.rename(columns={
    'question_user_id': 'user_id',
    'question_content': 'text',
    'question_sent': 'date',
    'question_user_country_code': 'country',
    'question_topic': 'topics',
})

In [23]:
q_df = q_df[['question_id', 'user_id', 'country', 'topics', 'text', 'clean_text', 'date']]

## Add farming season column

Since we're analysing seasonal trends in the dataset, I will add a column indicating the farming seasons as described in the [Project Brief](https://docs.google.com/document/d/1jKTmb8R5GlM9uqQkB5fXd37o2bdX17JKB36mK-NqWFE/edit?tab=t.0).

In [27]:
q_df['date'] = pd.to_datetime(q_df['date'], format='mixed', errors='coerce')
q_df['month'] = q_df['date'].dt.month

In [30]:
# KENYA
# Long rains from March to May
q_df.loc[
    (q_df['country'] == 'ke') &
    (q_df['month'].isin(range(3, 6))),
    'season'
] = 'long_rains'

# Short rains from October to December
q_df.loc[
    (q_df['country'] == 'ke') &
    (q_df['month'].isin(range(10, 13))),
    'season'
] = 'short_rains'

# Harvest periods from June to August and January to February
q_df.loc[
    (q_df['country'] == 'ke') &
    (q_df['month'].isin(range(6, 10))),
    'season'
] = 'harvesting'

q_df.loc[
    (q_df['country'] == 'ke') &
    (q_df['month'].isin(range(1, 3))),
    'season'
] = 'harvesting'

# September out of season
q_df.loc[
    (q_df['country'] == 'ke') &
    (q_df['month'] == 9),
    'season'
] = 'no_season'

In [32]:
# UGANDA
# Planting March-May (Season A) and September-November (Season B)
q_df.loc[
    (q_df['country'] == 'ug') &
    (q_df['month'].isin(range(3, 6))),
    'season'
] = 'planting'

q_df.loc[
    (q_df['country'] == 'ug') &
    (q_df['month'].isin(range(9, 12))),
    'season'
] = 'planting'

# Harvesting June-August (Season A) and December-February (Season B)
q_df.loc[
    (q_df['country'] == 'ug') &
    (q_df['month'].isin(range(6, 9))),
    'season'
] = 'harvesting'

q_df.loc[
    (q_df['country'] == 'ug') &
    (q_df['month'].isin([12, 1, 2])),
    'season'
] = 'harvesting'

In [41]:
# TANZANIA
# Masika rains March-May
q_df.loc[
    (q_df['country'] == 'tz') &
    (q_df['month'].isin(range(3, 6))),
    'season'
] = 'masika_rains'

# Vuli rains October-December
q_df.loc[
    (q_df['country'] == 'tz') &
    (q_df['month'].isin(range(10, 13))),
    'season'
] = 'vuli_rains'

# Harvesting June-August
q_df.loc[
    (q_df['country'] == 'tz') &
    (q_df['month'].isin(range(6, 10))),
    'season'
] = 'harvesting'

# January, February, September out of season
q_df.loc[
    (q_df['country'] == 'tz') &
    (q_df['month'].isin([1, 2, 9])),
    'season'
] = 'no_season'

After creating the column, let's check if there's any null values.

In [44]:
q_df['season'].isna().sum()

61

In [43]:
q_df[q_df['season'].isna()]['country'].unique()

array(['gb'], dtype=object)

Only GB has null values in the season column. Since we're analising only African countries, that's correct. \
Let's drop the month column as we don't need it anymore.

In [46]:
q_df.drop(columns=['month'], inplace=True)

## Filter countries and create dataframes per country

In [47]:
q_df['country'].value_counts()

country
ke    1290839
ug     927642
gb         61
tz          2
Name: count, dtype: int64

As we see, Tanzania has barely values left after cleaning. This is due to language restrictions, since most questions coming from Tanzania in the original dataset were not in English:

In [48]:
print(f"""Tanzania language value counts in the original dataset:
{df[df['question_user_country_code']=='tz']['question_language'].value_counts()}""")

Tanzania language value counts in the original dataset:
question_language
swa    4233714
eng         12
Name: count, dtype: int64


I will keep only Kenya and Uganda.

In [49]:
q_df = q_df[q_df['country'].isin(['ke', 'ug'])]

In [50]:
q_df_ke = q_df[q_df['country']=='ke']
q_df_ug = q_df[q_df['country']=='ug']

The final shape of the dataframes:

In [51]:
q_df.shape

(2218481, 8)

In [52]:
q_df_ke.shape

(1290839, 8)

In [53]:
q_df_ug.shape

(927642, 8)

## Export dataframes

In [54]:
# Dataframe with all questions in English from Kenya and Uganda together
q_df.to_csv("data/clean/EN_questions.csv", index=False)

In [55]:
# Dataframe with all questions in English from Kenya
q_df_ke.to_csv("data/clean/ke_EN_questions.csv", index=False)

In [56]:
# Dataframe with all questions in English from Uganda
q_df_ug.to_csv("data/clean/ug_EN_questions.csv", index=False)