# Introduction to Data Science 2025

# Week 6: Recap

## Exercise 1 | Linear regression with feature selection

Download the [TED Talks](https://www.kaggle.com/rounakbanik/ted-talks) dataset from Kaggle. Your task is to predict both the ratings and the number of views of a given TED talk. You should focus only on the <span style="font-weight: bold">ted_main</span> table.

1. Download the data, extract the following ratings from column <span style="font-weight: bold">ratings</span>: <span style="font-weight: bold">Funny</span>, <span style="font-weight: bold">Confusing</span>, <span style="font-weight: bold">Inspiring</span>. Store these values into respective columns so that they are easier to access. Next, extract the tags from column <span style="font-weight: bold">tags</span>. Count the number of occurrences of each tag and select the top-100 most common tags. Create a binary variable for each of these and include them in your data table, so that you can directly see whether a given tag (among the top-100 tags) is used in a given TED talk or not. The dataset you compose should have dimension (2550, 104), and comprise of the 'views' column, the three columns with counts of "Funny", "Confusing and "Inspiring" ratings, and 100 columns which one-hot encode the top-100 most common tag columns.


In [None]:
import pandas as pd
import numpy as np
import ast
from collections import Counter

df = pd.read_csv('ted_main.csv')

def extract_rating(ratings_str, rating_name):
    ratings_list = ast.literal_eval(ratings_str)
    for rating in ratings_list:
        if rating['name'] == rating_name:
            return rating['count']
    return 0

df['Funny'] = df['ratings'].apply(lambda x: extract_rating(x, 'Funny'))
df['Confusing'] = df['ratings'].apply(lambda x: extract_rating(x, 'Confusing'))
df['Inspiring'] = df['ratings'].apply(lambda x: extract_rating(x, 'Inspiring'))

all_tags = []
for tags_str in df['tags']:
    tags_list = ast.literal_eval(tags_str)
    all_tags.extend(tags_list)

tag_counts = Counter(all_tags)
top_100_tags = [tag for tag, count in tag_counts.most_common(100)]

for tag in top_100_tags:
    df[f'tag_{tag}'] = df['tags'].apply(lambda x: 1 if tag in ast.literal_eval(x) else 0)

data = df[['views', 'Funny', 'Confusing', 'Inspiring'] + [f'tag_{tag}' for tag in top_100_tags]]

print(f"Dataset shape: {data.shape}")

2. Construct a linear regression model to predict the number of views based on the data in the <span style="font-weight: bold">ted_main</span> table, including the binary variables for the top-100 tags that you just created.

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error

X = data.drop('views', axis=1)
y = data['views']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model_views = LinearRegression()
model_views.fit(X_train, y_train)

y_pred = model_views.predict(X_test)
r2 = r2_score(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)

print(f"Views model R²: {r2:.4f}")
print(f"Views model MSE: {mse:.2f}")

3. Do the same for the <span style="font-weight: bold">Funny</span>, <span style="font-weight: bold">Confusing</span>, and <span style="font-weight: bold">Inspiring</span> ratings.

In [None]:
X_tags = data[[col for col in data.columns if col.startswith('tag_')]]

models_ratings = {}
for rating in ['Funny', 'Confusing', 'Inspiring']:
    y_rating = data[rating]
    X_train_r, X_test_r, y_train_r, y_test_r = train_test_split(X_tags, y_rating, test_size=0.2, random_state=42)
    
    model = LinearRegression()
    model.fit(X_train_r, y_train_r)
    models_ratings[rating] = model
    
    y_pred_r = model.predict(X_test_r)
    r2_r = r2_score(y_test_r, y_pred_r)
    mse_r = mean_squared_error(y_test_r, y_pred_r)
    
    print(f"{rating} model R²: {r2_r:.4f}, MSE: {mse_r:.2f}")

4. You will probably notice that most of the tags are not useful in predicting the views and the ratings. You should use some kind of variable selection to prune the set of tags that are included in the model. You can use for example classical p-values or more modern [LASSO](https://en.wikipedia.org/wiki/Lasso_(statistics)) techniques. Which tags are the best predictors of each of the response variables?

In [None]:
from sklearn.linear_model import LassoCV
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_tags_scaled = scaler.fit_transform(X_tags)

targets = {
    'views': data['views'],
    'Funny': data['Funny'],
    'Confusing': data['Confusing'],
    'Inspiring': data['Inspiring']
}

important_tags = {}
for target_name, target_values in targets.items():
    lasso = LassoCV(cv=5, random_state=42, max_iter=10000)
    lasso.fit(X_tags_scaled, target_values)
    
    coefs = pd.DataFrame({
        'tag': [col.replace('tag_', '') for col in X_tags.columns],
        'coefficient': lasso.coef_
    })
    coefs['abs_coef'] = np.abs(coefs['coefficient'])
    important = coefs[coefs['abs_coef'] > 0].sort_values('abs_coef', ascending=False)
    important_tags[target_name] = important
    
    print(f"\n{target_name} - Top 10 important tags:")
    print(important.head(10)[['tag', 'coefficient']].to_string(index=False))

5. Produce summaries of your results. Could you recommend good tags – or tags to avoid! – for speakers targeting plenty of views and/or certain ratings?

In [None]:
print("=" * 60)
print("TED Talks 标签使用建议")
print("=" * 60)

for target_name in ['views', 'Funny', 'Confusing', 'Inspiring']:
    tags_df = important_tags[target_name]
    positive_tags = tags_df[tags_df['coefficient'] > 0].head(5)
    negative_tags = tags_df[tags_df['coefficient'] < 0].head(5)
    
    print(f"\n【{target_name}】")
    if len(positive_tags) > 0:
        print(f"  推荐使用的标签: {', '.join(positive_tags['tag'].values)}")
    if len(negative_tags) > 0:
        print(f"  避免使用的标签: {', '.join(negative_tags['tag'].values)}")

**Remember to submit your code on the MOOC platform. You can return this Jupyter notebook (.ipynb) or .py, .R, etc depending on your programming preferences.**

## Exercise 2 | Symbol classification (part 2)

Note that it is strongly recommended to use Python in this exercise. However, if you can find a suitable AutoML implementation for your favorite language (e.g [here](http://h2o-release.s3.amazonaws.com/h2o/master/3888/docs-website/h2o-docs/automl.html) seems to be one for R) then you are free to use that language as well.

Use the preprocessed data from week 3 (you can also produce them using the example solutions of week 3).

1. This time train a *random forest classifier* on the data. A random forest is a collection of *decision trees*, which makes it an *ensemble* of classifiers. Each tree uses a random subset of the features to make its prediction. Without tuning any parameters, how is the accuracy?

In [None]:
import os
from PIL import Image
from sklearn.ensemble import RandomForestClassifier

labels_df = pd.read_csv('HASYv2/hasy-data-labels.csv')
filtered_labels = labels_df[(labels_df['symbol_id'] >= 70) & (labels_df['symbol_id'] <= 79)]

images = []
labels = []
for idx, row in filtered_labels.iterrows():
    img_path = os.path.join('HASYv2', row['path'])
    img = Image.open(img_path).convert('L')
    img_array = np.array(img).flatten()
    images.append(img_array)
    labels.append(row['symbol_id'])

X = np.array(images)
y = np.array(labels)

indices = np.arange(len(X))
np.random.seed(42)
np.random.shuffle(indices)
X_shuffled = X[indices]
y_shuffled = y[indices]

split_idx = int(0.8 * len(X_shuffled))
X_train = X_shuffled[:split_idx]
X_test = X_shuffled[split_idx:]
y_train = y_shuffled[:split_idx]
y_test = y_shuffled[split_idx:]

rf_model = RandomForestClassifier(random_state=42)
rf_model.fit(X_train, y_train)

train_acc = rf_model.score(X_train, y_train)
test_acc = rf_model.score(X_test, y_test)

print(f"Random Forest - Training accuracy: {train_acc:.4f}")
print(f"Random Forest - Test accuracy: {test_acc:.4f}")

2. The amount of trees to use as a part of the random forest is an example of a hyperparameter, because it is a parameter that is set prior to the learning process. In contrast, a parameter is a value in the model that is learned from the data. Train 20 classifiers, with varying amounts of decision trees starting from 10 up until 200, and plot the test accuracy as a function of the amount of classifiers. Does the accuracy keep increasing? Is more better?

In [None]:
import matplotlib.pyplot as plt

n_estimators_list = np.linspace(10, 200, 20, dtype=int)
test_accuracies = []

for n_est in n_estimators_list:
    rf = RandomForestClassifier(n_estimators=n_est, random_state=42)
    rf.fit(X_train, y_train)
    test_acc = rf.score(X_test, y_test)
    test_accuracies.append(test_acc)
    print(f"n_estimators={n_est}, test accuracy={test_acc:.4f}")

plt.figure(figsize=(10, 6))
plt.plot(n_estimators_list, test_accuracies, marker='o')
plt.xlabel('Number of Trees')
plt.ylabel('Test Accuracy')
plt.title('Random Forest: Test Accuracy vs Number of Trees')
plt.grid(True)
plt.show()

3. If we had picked the amount of decision trees by taking the value with the best test accuracy from the last plot, we would have *overfit* our hyperparameters to the test data. Can you see why it is a mistake to tune hyperparameters of your model by using the test data?

**回答：**

使用测试数据来调整超参数是错误的，原因如下：

1. **数据泄露（Data Leakage）**：如果我们根据测试集的表现来选择超参数，实际上是让模型"看到"了测试数据的信息，这违反了测试集应该完全独立的原则。

2. **过拟合到测试集**：当我们反复在测试集上评估不同的超参数配置并选择表现最好的，模型会逐渐"记住"测试集的特点，导致在测试集上的性能过于乐观，无法真实反映模型在新数据上的泛化能力。

3. **无法评估真实性能**：测试集的作用是提供对模型泛化能力的无偏估计。一旦用测试集来调参，这个估计就不再无偏，我们就失去了评估模型真实性能的途径。

**正确做法**：应该将数据分为训练集、验证集和测试集三部分。使用训练集训练模型，使用验证集调整超参数，最后使用测试集（仅使用一次）评估最终模型的性能。

4. Reshuffle and resplit the data so that it is divided in 3 parts: training (80%), validation (10%) and test (10%). Repeatedly train a model of your choosing (e.g random forest) on the training data, and evaluate it’s performance on the validation set, while tuning the hyperparameters so that the accuracy on the validation set increases. Then, finally evaluate the performance of your model on the test data. What can you say in terms of the generalization of your model?

In [None]:
np.random.seed(42)
indices = np.arange(len(X))
np.random.shuffle(indices)
X_shuffled = X[indices]
y_shuffled = y[indices]

train_end = int(0.8 * len(X_shuffled))
val_end = int(0.9 * len(X_shuffled))

X_train_new = X_shuffled[:train_end]
y_train_new = y_shuffled[:train_end]
X_val = X_shuffled[train_end:val_end]
y_val = y_shuffled[train_end:val_end]
X_test_new = X_shuffled[val_end:]
y_test_new = y_shuffled[val_end:]

print(f"Training set: {X_train_new.shape}, Validation set: {X_val.shape}, Test set: {X_test_new.shape}")

best_score = 0
best_params = {}

for n_est in [50, 100, 150, 200]:
    for max_d in [10, 20, 30, None]:
        rf = RandomForestClassifier(n_estimators=n_est, max_depth=max_d, random_state=42)
        rf.fit(X_train_new, y_train_new)
        val_score = rf.score(X_val, y_val)
        
        if val_score > best_score:
            best_score = val_score
            best_params = {'n_estimators': n_est, 'max_depth': max_d}

print(f"\nBest parameters: {best_params}")
print(f"Best validation score: {best_score:.4f}")

final_model = RandomForestClassifier(**best_params, random_state=42)
final_model.fit(X_train_new, y_train_new)

train_acc_final = final_model.score(X_train_new, y_train_new)
val_acc_final = final_model.score(X_val, y_val)
test_acc_final = final_model.score(X_test_new, y_test_new)

print(f"\nFinal model performance:")
print(f"Training accuracy: {train_acc_final:.4f}")
print(f"Validation accuracy: {val_acc_final:.4f}")
print(f"Test accuracy: {test_acc_final:.4f}")

print(f"\n泛化能力评估：")
print(f"训练集和测试集的准确率差异为 {abs(train_acc_final - test_acc_final):.4f}")
if abs(train_acc_final - test_acc_final) < 0.05:
    print("模型具有良好的泛化能力，训练集和测试集性能相近。")
elif train_acc_final > test_acc_final + 0.1:
    print("模型可能存在过拟合，在训练集上表现明显好于测试集。")
else:
    print("模型泛化能力可接受。")

**Remember to submit your code on the MOOC platform. You can return this Jupyter notebook (.ipynb) or .py, .R, etc depending on your programming preferences.**

## Exercise 3 | TPOT

The process of picking a suitable model, evaluating its performance and tuning the hyperparameters is very time consuming. A new idea in machine learning is the concept of automating this by using an optimization algorithm to find the best model in the space of models and their hyperparameters. Have a look at [TPOT](https://github.com/EpistasisLab/tpot), an automated ML solution that finds a good model and a good set of hyperparameters automatically. Try it on this data, it should outperform simple models like the ones we tried easily. Note that running the algorithm might take a while, depending on the strength of your computer. 

*Note*: In case it is running for too long, try checking if the parameters you are using when calling TPOT are reasonable, i.e. try reducing number of ‘generations’ or ‘population_size’. TPOT uses cross-validation internally, so we don’t need our own validation set.

In [None]:
from tpot import TPOTClassifier

X_train_tpot = X_shuffled[:train_end]
y_train_tpot = y_shuffled[:train_end]
X_test_tpot = X_shuffled[train_end:]
y_test_tpot = y_shuffled[train_end:]

tpot = TPOTClassifier(
    generations=5,
    population_size=20,
    cv=5,
    random_state=42,
    verbosity=2,
    n_jobs=-1
)

tpot.fit(X_train_tpot, y_train_tpot)

tpot_score = tpot.score(X_test_tpot, y_test_tpot)
print(f"\nTPOT Test Accuracy: {tpot_score:.4f}")

tpot.export('tpot_pipeline.py')

**Remember to submit your code on the MOOC platform. You can return this Jupyter notebook (.ipynb) or .py, .R, etc depending on your programming preferences.**