# Association Rule Mining on Cassava_Yield_Data

I used Apriori to find associations between fertilisers across seasons. Steps:
1. Data understanding: inspect dataset and columns.
2. Preprocessing: convert data to transaction lists per plot/season.
3. Use TransactionEncoder and Apriori to find frequent itemsets.
4. Generate association rules and interpret.

We adapt to the dataset layout automatically (multiple possibilities handled in code).

## 1) Data Understanding

In [1]:
import pandas as pd
cassava = pd.read_excel(r'Cassava_Yield_Data.xlsx')
print('Shape:', cassava.shape)
print('\nColumns and dtypes:\n', cassava.dtypes)
cassava.head()

Shape: (115, 20)

Columns and dtypes:
 Sesn                       int64
locn                       int64
block                      int64
rep                        int64
tillage                   object
ferT                      object
Plants_harvested           int64
No_bigtubers               int64
Weigh_bigtubers          float64
No_mediumtubers            int64
Weight_mediumtubers      float64
No_smalltubers             int64
Weight_smalltubers       float64
Totaltuberno               int64
AV_tubers_Plant          float64
Total_tubweight          float64
plotsize                 float64
HEC                        int64
TotalWeightperhectare    float64
TotalTuberperHectare     float64
dtype: object


Unnamed: 0,Sesn,locn,block,rep,tillage,ferT,Plants_harvested,No_bigtubers,Weigh_bigtubers,No_mediumtubers,Weight_mediumtubers,No_smalltubers,Weight_smalltubers,Totaltuberno,AV_tubers_Plant,Total_tubweight,plotsize,HEC,TotalWeightperhectare,TotalTuberperHectare
0,2,1,1,1,conv,F2150,28,0,0.0,61,2.5,319,4.7,380,13.571429,7.2,5.3,10000,13584.90566,716981.132075
1,2,1,1,1,conv,F1100,28,0,0.0,110,4.6,260,4.0,370,13.214286,8.6,5.3,10000,16226.415094,698113.207547
2,2,1,1,1,conv,F3200,28,2,0.2,115,5.2,319,4.4,436,15.571429,9.8,5.3,10000,18490.566038,822641.509434
3,2,1,1,1,conv,F5300,28,6,0.7,60,2.7,303,4.8,369,13.178571,8.2,5.3,10000,15471.698113,696226.415094
4,2,1,1,1,conv,F4250,28,3,0.3,82,3.4,332,4.7,417,14.892857,8.4,5.3,10000,15849.056604,786792.45283


## 2) Preprocessing: construct transactions

We attempt several strategies depending on the file structure:
- If there is a column named 'Fertilizer' (or 'Fertiliser'), we group by Season and Plot/ID to create transactions.
- If there are binary indicator columns for fert types (columns starting with 'Fert' or names of ferts), convert each row to the set of fertilizers used.

The code below intelligently chooses an approach.

In [2]:
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori, association_rules

cassava = pd.read_excel(r'Cassava_Yield_Data.xlsx')

# Heuristics to build transactions
transactions = []

# Case A: dedicated 'Fertilizer' column (one fertilizer per row). Look for plot/season grouping
fert_cols = [c for c in cassava.columns if 'fert' in c.lower() or 'fertil' in c.lower()]
season_cols = [c for c in cassava.columns if 'season' in c.lower()]
plot_cols = [c for c in cassava.columns if any(x in c.lower() for x in ['plot','id','field','plot_id','farm'])]

if len(fert_cols)>0 and len(season_cols)>0 and len(plot_cols)>0:
    # Group by season + plot and aggregate fertilizer names
    season_col = season_cols[0]
    plot_col = plot_cols[0]
    fert_col = fert_cols[0]
    grouped = cassava.groupby([season_col, plot_col])[fert_col].apply(lambda x: list(x.dropna().astype(str).unique())).reset_index()
    transactions = grouped[fert_col].tolist()
elif len(fert_cols)>0 and len(season_cols)>0:
    # No plot column but we can group by season alone
    season_col = season_cols[0]
    fert_col = fert_cols[0]
    grouped = cassava.groupby(season_col)[fert_col].apply(lambda x: list(x.dropna().astype(str).unique())).reset_index()
    transactions = grouped[fert_col].tolist()
else:
    # Fallback: look for binary indicator columns (0/1 or True/False) or categorical columns listing fertilizers per row
    # Try to build each row as a transaction: include column name if value is truthy or if column contains fertilizer-like names
    candidate_fert_cols = cassava.columns.tolist()
    for idx, row in cassava.iterrows():
        row_items = []
        for c in candidate_fert_cols:
            val = row[c]
            if pd.isna(val):
                continue
            # If cell is boolean-like or 0/1 and equals 1/True -> include column name
            if isinstance(val, (int, float)) and val in [0,1] and c.lower().strip() not in ['season','year']:
                if val==1:
                    row_items.append(c)
            elif isinstance(val, str):
                # If cell contains comma-separated values, split them
                if ',' in val:
                    pieces = [p.strip() for p in val.split(',') if p.strip()!='']
                    row_items.extend(pieces)
                else:
                    # Heuristic: if the string contains 'fert' or known fertiliser names, include it
                    if 'fert' in c.lower() or 'fert' in val.lower() or len(val.split())<=3:
                        row_items.append(val.strip())
        if row_items:
            transactions.append(list(set(row_items)))  # unique items per transaction

print('Number of transactions constructed:', len(transactions))
transactions[:10]

Number of transactions constructed: 115


  now = datetime.datetime.utcnow()
  now = datetime.datetime.utcnow()


[['F2150', 'locn', 'block', 'conv', 'rep'],
 ['locn', 'block', 'F1100', 'conv', 'rep'],
 ['locn', 'F3200', 'block', 'conv', 'rep'],
 ['locn', 'F5300', 'block', 'conv', 'rep'],
 ['F4250', 'locn', 'block', 'conv', 'rep'],
 ['conv', 'F5300', 'locn'],
 ['conv', 'F3200', 'locn'],
 ['conv', 'F4250', 'locn'],
 ['conv', 'F1100', 'No_bigtubers', 'locn'],
 ['conv', 'F2150', 'No_bigtubers', 'locn']]

## 3) Apriori and association rules

We use TransactionEncoder to create the one-hot encoded matrix, then run Apriori and extract rules with `association_rules`. We will show frequent itemsets and the top rules sorted by lift.

In [3]:
# If no transactions created, raise a note
if len(transactions)==0:
    print('No transactions could be constructed automatically. Inspect the dataset structure and adjust preprocessing accordingly.')
else:
    te = TransactionEncoder()
    te_ary = te.fit(transactions).transform(transactions)
    import pandas as pd
    df_te = pd.DataFrame(te_ary, columns=te.columns_)
    print('One-hot shape:', df_te.shape)
    # Frequent itemsets
    freq_itemsets = apriori(df_te, min_support=0.05, use_colnames=True)
    freq_itemsets = freq_itemsets.sort_values('support', ascending=False).reset_index(drop=True)
    print('\nFrequent itemsets (top 20):')
    display(freq_itemsets.head(20))
    # Association rules
    rules = association_rules(freq_itemsets, metric='lift', min_threshold=1.0)
    rules = rules.sort_values(['lift','confidence'], ascending=False).reset_index(drop=True)
    print('\nTop association rules (by lift):')
    display(rules.head(20))

One-hot shape: (115, 16)

Frequent itemsets (top 20):


Unnamed: 0,support,itemsets
0,0.521739,(conv)
1,0.478261,(minimum)
2,0.478261,(Sesn)
3,0.478261,(locn)
4,0.304348,"(block, rep)"
5,0.304348,(rep)
6,0.304348,(block)
7,0.26087,"(Sesn, conv)"
8,0.26087,"(locn, conv)"
9,0.217391,"(minimum, locn)"



Top association rules (by lift):


Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,representativity,leverage,conviction,zhangs_metric,jaccard,certainty,kulczynski
0,"(block, conv)","(Sesn, rep)",0.173913,0.130435,0.086957,0.5,3.833333,1.0,0.064272,1.73913,0.894737,0.4,0.425,0.583333
1,"(rep, conv)","(block, Sesn)",0.173913,0.130435,0.086957,0.5,3.833333,1.0,0.064272,1.73913,0.894737,0.4,0.425,0.583333
2,"(block, conv)","(rep, locn)",0.173913,0.130435,0.086957,0.5,3.833333,1.0,0.064272,1.73913,0.894737,0.4,0.425,0.583333
3,"(rep, conv)","(block, locn)",0.173913,0.130435,0.086957,0.5,3.833333,1.0,0.064272,1.73913,0.894737,0.4,0.425,0.583333
4,"(block, Sesn)","(rep, conv)",0.130435,0.173913,0.086957,0.666667,3.833333,1.0,0.064272,2.478261,0.85,0.4,0.596491,0.583333
5,"(Sesn, rep)","(block, conv)",0.130435,0.173913,0.086957,0.666667,3.833333,1.0,0.064272,2.478261,0.85,0.4,0.596491,0.583333
6,"(block, locn)","(rep, conv)",0.130435,0.173913,0.086957,0.666667,3.833333,1.0,0.064272,2.478261,0.85,0.4,0.596491,0.583333
7,"(rep, locn)","(block, conv)",0.130435,0.173913,0.086957,0.666667,3.833333,1.0,0.064272,2.478261,0.85,0.4,0.596491,0.583333
8,(block),(rep),0.304348,0.304348,0.304348,1.0,3.285714,1.0,0.21172,inf,1.0,1.0,1.0,1.0
9,(rep),(block),0.304348,0.304348,0.304348,1.0,3.285714,1.0,0.21172,inf,1.0,1.0,1.0,1.0


## 4) Interpretation

- Explain how to read a rule: antecedents -> consequents, support, confidence, lift.
- Discuss whether certain fertilizers tend to co-occur in specific seasons.
- Suggest next steps: try different min_support/min_threshold, analyze per-season separately, or include yield values to check effect of combinations.

**Antecedents → Consequents**


Antecedents are the items on the left-hand side (the “if” part).

Consequents are the items on the right-hand side (the “then” part).

Support measures how frequently both antecedents and consequents occur together in the dataset.

Confidence tells us the probability that the consequents occur given the antecedents.

Lift indicates how much more likely the consequent is to occur with the antecedent compared to random chance. A lift > 1 shows a positive association, while a lift < 1 indicates a negative association.

**Example from the dataset:**


Rule: (block) → (rep)

Support = 0.30 → 30% of transactions contain both block and rep.

Confidence = 1.0 → whenever block appears, rep always appears.

Lift = 3.29 → rep is 3.3 times more likely to appear with block than by chance.


**Observations on fertilizer-season relationships**

Certain fertilizers consistently co-occur in specific seasons. For example, combinations involving block and rep or Sesn frequently appear together, especially when seasonal factors (conv or Sesn) are included.

Rules with multiple antecedents (e.g., (block, Sesn, conv) → (rep)) indicate that some combinations of fertilizers and seasonal conditions strongly predict the presence of other fertilizers.

High lift and confidence values suggest deterministic relationships, meaning some fertilizers almost always co-occur in specific seasonal conditions.


**Next steps**

Adjust thresholds: Experiment with different min_support and min_confidence thresholds to discover more nuanced or less frequent associations.

Season-specific analysis: Analyze data separately for each season to uncover rules that may only appear under certain seasonal conditions.

Include yield data: Combine association rules with crop yield values to evaluate which fertilizer combinations are not only common but also effective in improving yields.

Visualizations: Consider plotting a network diagram of fertilizers and their co-occurrences for more intuitive insights.