In [2]:
#install.packages("infotheo")
library(infotheo)

**Mutual Information (MI)** is an information-theoretic measure of the dependence between two random variables \(X\) and \(Y\). It quantifies **how much knowing one of these variables reduces uncertainty about the other**. Formally, for discrete random variables, it is defined as:

$I(X; Y) = \sum_{x \in \mathcal{X}} \sum_{y \in \mathcal{Y}} p(x, y) \log \frac{p(x, y)}{p(x)\,p(y)}$

where:
- \( p(x, y) \) is the joint probability distribution of \( X \) and \( Y \).
- \( p(x) \) and \( p(y) \) are the marginal probability distributions of \( X \) and \( Y \).
- The log base determines the unit of the mutual information (e.g., bits if base 2, nats if base \( e \)).

**Key Properties**:
- **Non-negativity**: \( I(X;Y) \ge 0 \).
- **Symmetry**: \( I(X;Y) = I(Y;X) \).
- **Independence**: \( I(X;Y) = 0 \) if and only if \( X \) and \( Y \) are independent.
- MI can capture **nonlinear** relationships between variables, unlike measures of linear correlation.

By contrast:
1. **Pearson’s correlation** measures only linear relationships.
2. **Spearman’s rho** (rank correlation) captures monotonic relationships (they can be nonlinear but must be strictly monotonic).

Hence, mutual information can detect more general statistical dependencies compared to Pearson or Spearman alone.


In [3]:
N = 1000

In [4]:
X_lin = rnorm(N)
Y_lin = 3 * X_lin + rnorm(N, sd = 0.5)


In [6]:
mi_lin = mutinformation(discretize(X_lin), discretize(Y_lin))
mi_lin

In [8]:
pearson_lin = cor(X_lin, Y_lin, method = "pearson")
spearman_lin = cor(X_lin, Y_lin, method = "spearman")
print(pearson_lin)
print(spearman_lin)

[1] 0.9836894
[1] 0.9818015


In [5]:
X_nonlin = rnorm(N)
Y_nonlin = X_nonlin^2 + rnorm(N, sd = 0.5)

In [7]:
mi_nonlin = mutinformation(discretize(X_nonlin), discretize(Y_nonlin))
mi_nonlin

In [9]:
pearson_nonlin = cor(X_nonlin, Y_nonlin, method = "pearson")
spearman_nonlin = cor(X_nonlin, Y_nonlin, method = "spearman")
print(pearson_nonlin)
print(spearman_nonlin)

[1] -0.001887586
[1] -0.008231636


In [10]:
N = 1000

In [11]:
X1 = rnorm(N)

In [14]:
X2 = 2 * X1 + rnorm(N, sd = 0.3)#linear

In [13]:
X3 = X1^2 + rnorm(N, sd = 0.2) #quadratice dependency

In [15]:
X4 = sin(X1) + rnorm(N, sd = 0.2)#sinusoidal dependency

In [16]:
X5 = runif(N, min = -1, max = 1)#independent uniform

In [17]:
X6 = rnorm(N)#independent variable

In [18]:
df = data.frame(X1, X2, X3, X4, X5, X6)

In [19]:
head(df)

Unnamed: 0_level_0,X1,X2,X3,X4,X5,X6
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,-1.4144286,-2.78787437,2.0352345,-0.8150974,0.1855525,0.7007855
2,-0.8413664,-2.089523,1.1533392,-0.8389696,0.84941,0.2730103
3,-0.1619804,-0.06858428,0.2947187,-0.4475097,0.1061987,0.3989977
4,0.6695147,1.76137513,0.1994325,0.7013241,0.8699784,-0.562031
5,-1.6869419,-3.22427263,2.8740125,-0.8626623,-0.7482232,0.7553628
6,1.3598116,2.42936264,2.0190178,0.5844627,-0.1279192,-1.0108293


In [34]:
df_disc = data.frame(matrix(nrow = nrow(df), ncol = ncol(df)))
colnames(df_disc) = colnames(df)
head(df_disc,3)

Unnamed: 0_level_0,X1,X2,X3,X4,X5,X6
Unnamed: 0_level_1,<lgl>,<lgl>,<lgl>,<lgl>,<lgl>,<lgl>
1,,,,,,
2,,,,,,
3,,,,,,


In [35]:
colnames(df_disc)

In [38]:
for (col_name in colnames(df)) {
  df_disc[,col_name] = discretize(df[,col_name], disc = "equalfreq", nbins = 10)
}

In [39]:
str(df_disc)

'data.frame':	1000 obs. of  6 variables:
 $ X1: int  1 2 5 8 1 10 10 3 4 5 ...
 $ X2: int  1 2 6 9 1 10 10 2 4 5 ...
 $ X3: int  9 8 5 4 10 9 10 7 3 1 ...
 $ X4: int  2 2 3 9 2 8 6 3 4 5 ...
 $ X5: int  6 10 6 10 2 5 1 5 7 3 ...
 $ X6: int  8 7 7 3 8 2 1 9 4 2 ...


In [41]:
df_disc[11:14,1:3]

Unnamed: 0_level_0,X1,X2,X3
Unnamed: 0_level_1,<int>,<int>,<int>
11,4,5,3
12,6,5,1
13,10,10,10
14,10,10,10


In [43]:
#all pairs of column indices
var_names = colnames(df_disc)
pairs_idx = combn(seq_along(var_names), 2, simplify = FALSE)
pairs_idx

In [44]:
var_names = colnames(df_disc)
num_vars = length(var_names)

In [47]:
mi_results_df = data.frame(
  Var1 = character(),
  Var2 = character(),
  MutualInformation   = numeric(),
  stringsAsFactors = FALSE
)

In [48]:
for (i in 1:(num_vars - 1)) {
  for (j in (i + 1):num_vars) {
    var1 = var_names[i]
    var2 = var_names[j]
    
    mi_value = mutinformation(df_disc[[var1]], df_disc[[var2]])
    
    mi_results_df = rbind(
      mi_results_df,
      data.frame(Var1 = var1, Var2 = var2, MutualInformation = mi_value, stringsAsFactors = FALSE)
    )
  }
}

In [49]:
mi_results_df

Var1,Var2,MutualInformation
<chr>,<chr>,<dbl>
X1,X2,1.46824766
X1,X3,0.8517312
X1,X4,0.99306379
X1,X5,0.0350608
X1,X6,0.0463948
X2,X3,0.73360066
X2,X4,0.91051138
X2,X5,0.03164519
X2,X6,0.04758403
X3,X4,0.38502225


In [51]:
mi_results_df = mi_results_df[order(-mi_results_df$MutualInformation), ]

In [52]:
mi_results_df

Unnamed: 0_level_0,Var1,Var2,MutualInformation
Unnamed: 0_level_1,<chr>,<chr>,<dbl>
1,X1,X2,1.46824766
3,X1,X4,0.99306379
7,X2,X4,0.91051138
2,X1,X3,0.8517312
6,X2,X3,0.73360066
10,X3,X4,0.38502225
14,X4,X6,0.0501695
12,X3,X6,0.04879728
9,X2,X6,0.04758403
5,X1,X6,0.0463948


# questions

make a dataframe `df` with 5 columns: `A, B, C, D, E`.  
1. Discretize each column into 5 equal-frequency bins.  
2. Compute the **pairwise mutual information** between all columns.  
3. Sort the results in **descending** order of MI.  
4. Show the top 3 pairs (highest MI).

**Answer **  
Below is an example final output (the exact numeric MI values may differ if you use random data, but the format is illustrative):

| Var1 | Var2 |   MI   |
|------|------|--------|
| B    | D    | 0.71   |
| B    | E    | 0.59   |
| A    | D    | 0.52   |
| ...  | ...  | ...    |

The top 3 pairs with the highest mutual information are `(B,D)`, `(B,E)`, and `(A,D)`.



two numeric vectors `X` and `Y` each of length 1,000. Suppose `Y` is defined as:
\[
Y = 2 \cdot X + \text{some\_noise}
\]
1. Generate **10** random samples from `X` and `Y` (i.e., the first 10 rows). Show them in a small data frame.  
2. Compute the **Pearson correlation** between `X` and `Y`.  
3. Compute the **Spearman rank correlation** between `X` and `Y`.  
4. Compute the **Mutual Information** (discretized into 8 bins).

**Answer**  
An example of the final numeric results might look like:

- **First 10 rows** (just an example):

  | idx |    X     |     Y     |
  |-----|----------|-----------|
  | 1   | -0.427   | -0.989    |
  | 2   |  1.025   |  2.215    |
  | ... |  ...     |   ...     |

- **Pearson correlation**: ~ 0.94  
- **Spearman correlation**: ~ 0.93  
- **Mutual Information**: ~ 0.85

a 6-column dataframe `df` with columns: `X1, X2, X3, X4, X5, X6`
1. Assume `X1` and `X2` are strongly linearly related, `X3` and `X4` are **nonlinear** (e.g., \(X4 = \sin(X3)\)), and `X5, X6` are **independent** from the other columns.  
2. Discretize all columns into 6 equal-frequency bins.  
3. Calculate **pairwise mutual information** and store results in a new data frame.  
4. **Which pair** has the highest MI? Which pair has the **lowest** MI?

**Answer**  
An example of the final pairwise MI results (sorted):

| Var1 | Var2 |   MI   |
|------|------|--------|
| X1   | X2   | 0.90   |  (Highest)  
| X3   | X4   | 0.73   |  
| X5   | X6   | 0.02   |  (Lowest)  
| ...  | ...  | ...    |

- **Highest MI** is `(X1, X2)` at 0.90.  
- **Lowest MI** is `(X5, X6)` at 0.02.  

These results suggest `X1` and `X2` are strongly dependent, `X3` and `X4` also significantly dependent (nonlinearly), and `X5` and `X6` have minimal dependence.


In [64]:
library(datasets)

In [69]:
data(iris)

In [70]:
df = iris

In [73]:
df_disc = data.frame(
  matrix(nrow = nrow(df), ncol = 4)
)

In [74]:
colnames(df_disc) = colnames(df)[1:4]

for (col_name in colnames(df)[1:4]) {
  df_disc[,col_name] = discretize(df[,col_name], disc = "equalfreq", nbins = 5)
}

In [75]:
df_disc$Species = df$Species

In [76]:
str(df_disc)
head(df_disc)

'data.frame':	150 obs. of  5 variables:
 $ Sepal.Length: int  2 1 1 1 1 2 1 1 1 1 ...
 $ Sepal.Width : int  5 2 4 3 5 5 4 4 2 3 ...
 $ Petal.Length: int  1 1 1 1 1 2 1 1 1 1 ...
 $ Petal.Width : int  1 1 1 1 1 2 2 1 1 1 ...
 $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...


Unnamed: 0_level_0,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
Unnamed: 0_level_1,<int>,<int>,<int>,<int>,<fct>
1,2,5,1,1,setosa
2,1,2,1,1,setosa
3,1,4,1,1,setosa
4,1,3,1,1,setosa
5,1,5,1,1,setosa
6,2,5,2,2,setosa


In [77]:
var_names = colnames(df_disc)
num_vars = length(var_names)

In [78]:
mi_results_df = data.frame(
  Var1 = character(),
  Var2 = character(),
  MI   = numeric(),
  stringsAsFactors = FALSE
)

In [79]:
for (i in 1:(num_vars - 1)) {
  for (j in (i + 1):num_vars) {
    v1 = var_names[i]
    v2 = var_names[j]
    
    mi_value = mutinformation(df_disc[[v1]], df_disc[[v2]])
    
    mi_results_df = rbind(
      mi_results_df,
      data.frame(Var1 = v1, Var2 = v2, MI = mi_value, stringsAsFactors = FALSE)
    )
  }
}

In [80]:
mi_results_df = mi_results_df[order(-mi_results_df$MI), ]
mi_results_df

Unnamed: 0_level_0,Var1,Var2,MI
Unnamed: 0_level_1,<chr>,<chr>,<dbl>
8,Petal.Length,Petal.Width,0.8455931
10,Petal.Width,Species,0.8328724
9,Petal.Length,Species,0.8278344
2,Sepal.Length,Petal.Length,0.585144
3,Sepal.Length,Petal.Width,0.5367162
4,Sepal.Length,Species,0.4292148
6,Sepal.Width,Petal.Width,0.3119989
7,Sepal.Width,Species,0.2499608
5,Sepal.Width,Petal.Length,0.2483765
1,Sepal.Length,Sepal.Width,0.1947409


In [82]:
#miami housing data
df_raw = read.csv("miami-housing.csv", header = TRUE)

In [83]:
keep_cols = c(
  "SALE_PRC",      # target variable
  "LATITUDE",
  "LONGITUDE",
  "LND_SQFOOT",
  "TOT_LVG_AREA",
  "SPEC_FEAT_VAL",
  "RAIL_DIST",
  "OCEAN_DIST",
  "WATER_DIST",
  "CNTR_DIST",
  "SUBCNTR_DI",
  "HWY_DIST",
  "age",
  "avno60plus",
  "month_sold",
  "structure_quality"
)

In [84]:
df = df_raw[, keep_cols]

In [85]:
df_disc = data.frame(
  matrix(nrow = nrow(df), ncol = ncol(df))
)

colnames(df_disc) = colnames(df)

In [86]:
for (col_name in colnames(df)) {

  if (is.numeric(df[[col_name]])) {
    df_disc[[col_name]] = discretize(df[[col_name]], disc = "equalfreq", nbins = 8)
  } else {

    df_disc[[col_name]] = df[[col_name]]
  }
}

In [87]:
target_var = "SALE_PRC"
all_vars = colnames(df_disc)

In [88]:
mi_results = data.frame(Variable = character(),
                        MI       = numeric(),
                        stringsAsFactors = FALSE)


In [89]:
for (var_name in all_vars) {
  if (var_name != target_var) {
    mi_value = mutinformation(df_disc[[target_var]], df_disc[[var_name]])
    
    # Store result
    mi_results = rbind(
      mi_results,
      data.frame(Variable = var_name, MI = mi_value, stringsAsFactors = FALSE)
    )
  }
}

In [90]:
mi_results = mi_results[order(-mi_results$MI), ]

In [91]:
mi_results

Unnamed: 0_level_0,Variable,MI
Unnamed: 0_level_1,<chr>,<dbl>
4,TOT_LVG_AREA,0.328235566
1,LATITUDE,0.176728502
7,OCEAN_DIST,0.170094265
2,LONGITUDE,0.167782448
15,structure_quality,0.149154776
9,CNTR_DIST,0.132357729
10,SUBCNTR_DI,0.131949268
5,SPEC_FEAT_VAL,0.12520301
3,LND_SQFOOT,0.123727033
8,WATER_DIST,0.092534684


## stock data


In [92]:
data("EuStockMarkets")

In [93]:
str(EuStockMarkets)

 Time-Series [1:1860, 1:4] from 1991 to 1999: 1629 1614 1607 1621 1618 ...
 - attr(*, "dimnames")=List of 2
  ..$ : NULL
  ..$ : chr [1:4] "DAX" "SMI" "CAC" "FTSE"


In [94]:
stock_df = as.data.frame(EuStockMarkets)

In [95]:
colnames(stock_df) = c("DAX", "SMI", "CAC", "FTSE")

In [96]:
stock_df_returns = data.frame(
  DAX  = diff(log(stock_df$DAX)),
  SMI  = diff(log(stock_df$SMI)),
  CAC  = diff(log(stock_df$CAC)),
  FTSE = diff(log(stock_df$FTSE))
)

In [97]:
df_disc = data.frame(
  matrix(nrow = nrow(stock_df), ncol = ncol(stock_df))
)
colnames(df_disc) = colnames(stock_df)

In [99]:
for (col_name in colnames(stock_df)) {
  df_disc[,col_name] = discretize(stock_df[,col_name], 
                                   disc = "equalfreq", 
                                   nbins = 10)
}

In [101]:
head(df_disc)

Unnamed: 0_level_0,DAX,SMI,CAC,FTSE
Unnamed: 0_level_1,<int>,<int>,<int>,<int>
1,2,1,1,1
2,2,1,1,1
3,2,1,1,1
4,2,1,1,1
5,2,1,1,1
6,2,1,1,1


In [102]:
var_names = colnames(df_disc)
num_vars = length(var_names)

In [103]:
mi_results = data.frame(
  Var1 = character(),
  Var2 = character(),
  MI   = numeric(),
  stringsAsFactors = FALSE
)

In [104]:
for (i in 1:(num_vars - 1)) {
  for (j in (i + 1):num_vars) {
    v1 = var_names[i]
    v2 = var_names[j]
    
    mi_value = mutinformation(df_disc[[v1]], df_disc[[v2]])
    
    mi_results = rbind(
      mi_results,
      data.frame(Var1 = v1, Var2 = v2, MI = mi_value, stringsAsFactors = FALSE)
    )
  }
}


In [105]:
mi_results = mi_results[order(-mi_results$MI), ]
mi_results

Unnamed: 0_level_0,Var1,Var2,MI
Unnamed: 0_level_1,<chr>,<chr>,<dbl>
5,SMI,FTSE,1.6044848
1,DAX,SMI,1.4875321
3,DAX,FTSE,1.3827812
2,DAX,CAC,0.9279278
6,CAC,FTSE,0.8663173
4,SMI,CAC,0.8238239
