In this problem set, we'll do both unsupervised and supervised learning on text.

Similar to PS1, you're free to execute the notebook in your personal environment, but I would strongly recommend using Google Colab. You can upload this notebook to Google colab by following the steps below.

1. Open [colab.research.google.com](colab.research.google.com)
2. Click on the upload tab
3. Upload the `.ipynb` file by choosing the right file from your local disk


**Submission instructions**

1. When you're ready to submit, you'll save the notebook as QTM340-PS2-Firstname-Lastname.ipynb; for example, if your name is Harry Potter, save the file as `QTM340-PS2-Harry-Potter.ipynb`. This can be done in Google colab by editing the filename and then following File --> Download --> .ipynb

2. Upload this file on canvas.

In this problem set, you'll learn to:

(a) Use `gensim` to find topics

(b) Use scikit-learn to train multiple classifiers

(c) Calculate CI and test hypothesis

We'll do all of this on a collection of movie plot summaries that are extracted from Wikipedia courtesy of this [paper from Bamman et. al.](http://www.cs.cmu.edu/~dbamman/pubs/pdf/bamman+oconnor+smith.acl13.pdf).

In [277]:
!wget http://www.cs.cmu.edu/~ark/personas/data/MovieSummaries.tar.gz
!tar -xzvf MovieSummaries.tar.gz

--2023-10-16 07:32:59--  http://www.cs.cmu.edu/~ark/personas/data/MovieSummaries.tar.gz
Resolving www.cs.cmu.edu (www.cs.cmu.edu)... 128.2.42.95
Connecting to www.cs.cmu.edu (www.cs.cmu.edu)|128.2.42.95|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 48002242 (46M) [application/x-gzip]
Saving to: ‘MovieSummaries.tar.gz.1’


2023-10-16 07:33:13 (3.17 MB/s) - ‘MovieSummaries.tar.gz.1’ saved [48002242/48002242]

MovieSummaries/
MovieSummaries/tvtropes.clusters.txt
MovieSummaries/name.clusters.txt
MovieSummaries/plot_summaries.txt
MovieSummaries/README.txt
MovieSummaries/movie.metadata.tsv
MovieSummaries/character.metadata.tsv


You'll find a `MovieSummaries` directory which contains many files. We're interested in the `README.txt`, `plot_summaries.txt`, and `movie.metadata.txt`.

Let's print the `README.txt` file.

In [278]:
!cat MovieSummaries/README.txt

This README describes data in the CMU Movie Summary Corpus, a collection of 42,306 movie plot summaries and metadata at both the movie level (including box office revenues, genre and date of release) and character level (including gender and estimated age).  This data supports work in the following paper:

David Bamman, Brendan O'Connor and Noah Smith, "Learning Latent Personas of Film Characters," in: Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL 2013), Sofia, Bulgaria, August 2013.

All data is released under a Creative Commons Attribution-ShareAlike License. For questions or comments, please contact David Bamman (dbamman@cs.cmu.edu).

###
#
# DATA
#
###

1. plot_summaries.txt.gz [29 M] 

Plot summaries of 42,306 movies extracted from the November 2, 2012 dump of English-language Wikipedia.  Each line contains the Wikipedia movie ID (which indexes into movie.metadata.tsv) followed by the summary.


2. corenlp_plot_summaries.tar.gz [628 M, sep

The dataset directory contains wikipedia plot summaries for 42306 movies. There is also metadata for 81741 movies. For this assignment, we're interested in the genre of a movie, its wiki id, its release date, and its plot summary

In [None]:
import pandas as pd
import gensim
from tqdm import tqdm
import numpy as np
import sklearn
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, Lasso
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt
%matplotlib inline
import nltk
from nltk.corpus import stopwords
nltk.download('punkt')
nltk.download('stopwords')
tqdm.pandas()

Let's read the entire dataset as a pandas dataframe.

In [280]:
# Read the plot summaries into a dataframe
df = pd.read_csv ('MovieSummaries/plot_summaries.txt', sep='\t', names=['wiki_id', 'summary'])

# Map the wikiid to a number and vice versa
wikiid2rownum = {id: i for i, id in enumerate (df.wiki_id.values)}
rownum2wikiid = {i: id for i, id in enumerate (df.wiki_id.values)}

# Read the metadata about the movies into a dataframe
metadata = pd.read_csv ("MovieSummaries/movie.metadata.tsv", sep='\t', names=["wiki_id",
                                                                              "freebase_id",
                                                                              "name",
                                                                              "release_date",
                                                                              "revenue",
                                                                              "runtime",
                                                                              "languages",
                                                                              "countries",
                                                                              "genres"])

## 1. Topic modeling [2 points]

We'll first try to do some exploratory analysis based on topic modeling. Particularly, we're interested in finding

- the topics in this collection
- the correlation of the topics
- the prevalence of topics over time

To do so, first we'll use the `nltk` library to tokenize the plot summaries. We could have also used `spacy` but it takes too much time.

We'll lower case the summaries, remove all the stop words, and only consider the alphabetic characters. The following code takes 2-3 minutes to run on google colab

In [281]:
def tokenize_summaries (summary):
  stop = set(stopwords.words ('english'))
  tokens = nltk.word_tokenize(summary)
  tokens = [token.lower () for token in tokens if token.isalpha() and token not in stop]
  return tokens

df["tokens"] = df["summary"].progress_apply (lambda x: tokenize_summaries (x))

100%|██████████| 42303/42303 [02:23<00:00, 293.91it/s]


Let's create a dictionary for the corpus. We can do this by calling gensim's `Dictionary` object and passing the tokenized corpus. We'll also apply some light filters to trim the vocabulary to do topic modeling.

In [282]:
# Construct a dictionary of words from the corpus
dictionary = gensim.corpora.Dictionary(df['tokens'])
print (f"Before filtering: {len (dictionary)}")

# Filter the dictionary to meet frequency thresholds
dictionary.filter_extremes(no_below=10,
                           no_above=0.5,
                           keep_n=10000)
print (f"After filtering: {len (dictionary)}")

Before filtering: 134369
After filtering: 10000


By creating a dictionary in gensim, you map every token to an id, allowing you to convert a stream of tokens to a stream of ids. You can access these ids using `token2id` property that gets set. To run the LDA topic model, we'll have to transform the documents

In [283]:
# The map of wordids back the words
id2token = {id: token for token, id in dictionary.token2id.items()}

# Construction of the corpus in the format that gensim expects
corpus = [dictionary.doc2bow(doc) for doc in df['tokens']]

**Your turn!** Run LDA model for the above corpus to get 10 topics. [0.5 points]

You'll want to use `LdaMulticore` model and set the number of passes or sweeps over the data to 5 and number of iterations to 50.

**Note:** This may take a few minutes.

In [284]:
lda = None
# Your code below


In [None]:
# For the code below to run, your topic model should be in variable named lda

topics = lda.show_topics(num_topics=10, num_words=10, log=True, formatted=False)

# We'll print the top words associated with each topic
for topic_num, topic_words in topics:
  topic_rep = " ".join ([f"{w}({p:.4f})" for w,p in topic_words])
  print (f"Topic {topic_num}: {topic_rep}")

**Sanity check**: I found one topic which seemed to be about war and army, etc


**Your turn!** Now calculate the following distributions as numpy matrices[0.5 points]

(a) For every document, get the mixture of topics. We can do this by calling the `get_document_topics` of the lda model method on the entire corpus.

(b) For every topic, get the distribution over words. We can do this by calling the `get_topic_terms` of the lda model on the entire vocabulary.

You'll implement two functions below to calculate these.



In [10]:
def get_topic_mixture (corpus, lda_model):
  """ Get mixture of topics from the corpus.

  :params:
  :corpus (list): The transformed corpus where every document is a list
  :lda_model (gensim.models.ldamulticore.LdaMulticore): The gensim model

  :returns:
  topic_mixture (np.array): The corpus is represented as a numpy matrix, where
                            each row is a document and the columns correspond
                            to topics.

                            Size = number of documents x number of topics

  """
  topic_mixture = None
  # Your code here

  return topic_mixture

def get_word_dist (lda_model):
  """ Get the probability distribution of words for all the topics.

  :params:
  :lda_model (gensim.models.ldamulticore.LdaMulticore): The gensim model

  :returns:
  word_dist (np.array): The topics are represented as a numpy matrix, where
                        each row is a topic and the columns correspond to the
                        words.

                        Size = number of topics x number of words
  """
  word_dist = None
  # Your code here

  return word_dist

Now, let's call these functions to get the distributions.

In [11]:
topic_mix = get_topic_mixture (corpus, lda)
word_dist = get_word_dist (lda)

In fact, we'll just add all the topic probabilities to the dataframe

In [12]:
for i in range (lda.num_topics):
  df[f'topic_{i}'] = topic_mix[:,i]

You can see that we have added additional columns to the dataframe to have the distribution of topics for every movie.

In [None]:
df.head(3)

**Your turn!** Now let's find the correlation between the topics. We can do this by calculating the cosine distance between each pair of topics using the code from problem set 1 or you can use [this scikit-learn method](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise_distances.html). Notice that every topic is a vectory of probabilities over words.

Give 3 topic pairs with the strongest correlation and 3 topic pairs with the weakest correlation? [0.5 points]

In [14]:
# Your code here to calculate the cosine distances


In [None]:
plt.imshow(dist, cmap='Greys', interpolation='nearest')

**Your turn!** Find the distribution of topics over time. Following are the steps that you want to follow, which you'll implement in each cell below [0.5 points]

We have the release date for each movie in the `release_date` column of the metadata frame.  We want to first filter the dataframe to exclude any missing values (NaN)

In [16]:
# Step 1 filter the metadata (Your code below)


- Next, you'll create two new columns for the year of release and decade of the release. Note that the release date is either the year directly or the full date string.

In [17]:
def extract_year (pub_date):
  """ Extract the year from the publication date string
  :params:
  pub_date (str): The publication date as a string

  :returns:
  year (int): The release year of the movie
  """
  pass
  # Your code below


def get_decade (year):
  """ Convert the year into the decade i.e 2017 --> 2010; 1992 -->1990

  :params:
  year (int): The year of release

  :returns:
  decade (int): The decade of the release
  """
  pass
  # Your code below


# Step 2 add the two columns
new_metadata["year"] = new_metadata["release_date"].apply (lambda x: extract_year (x))
new_metadata["decade"] = new_metadata["year"].apply (lambda x: get_decade (x))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_metadata["year"] = new_metadata["release_date"].apply (lambda x: extract_year (x))
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_metadata["decade"] = new_metadata["year"].apply (lambda x: get_decade (x))


- Now join the dataframes that contain the text and the topics with the metadata by matching the field `wiki_id` from both the dataframes.

In [18]:
# Step 3 Join the two dataframes on the common column (Your code below)


- Finally, calculate the average probability per decade from 1900 to 2010. You may want to use pandas `groupby` function from the pandas library

In [None]:
decades = [1900,
           1910,
           1920,
           1930,
           1940,
           1950,
           1960,
           1970,
           1980,
           1990,
           2000,
           2010]

# Step 4: calculate the average probability by grouping (Your code below)


In [None]:
per_decade_plot.plot(x="decade", y=[f"topic_{i}" for i in range (10)], alpha=0.5)

**Sanity check!** I found the "war" topic to peak in the 1910's then fall during the great depression and rise again in the 50's after which it has remained quite steady. I'm not sure if it matches any hypothesis but it's interesting nonetheless!

## 2. Prediction [4 points]

In this section, we'll develop multiple regression models to predict the box office revenue of a movie. We'll start by first filtering the dataframe to contain only those movies which don't have any missing data.

In [21]:
regression_df = overall_df.dropna() # remove all the rows that contain any missing values
regression_df = regression_df.query ("decade in @decades")

Once you have the `regression_df` dataframe, we'll create another column called `tokenized_text` as follows

In [22]:
regression_df["tokenized_text"] = regression_df["tokens"].apply (lambda x:" ".join (x))

Let's see what the regression dataframe looks like

In [None]:
regression_df.head (5)

The code above simply joins the individual tokens from the plot summary in any movie into a string in which the tokens are separated by a whitespace

Next, we'll split the movies into a train and a test set as follows

In [23]:
train_wikiids, test_wikiids = train_test_split(regression_df.wiki_id,
                                               test_size=0.2,
                                               random_state=42)

We can create an array for our dependent variable in both the train and test sets

In [24]:
y_train = regression_df.query ('wiki_id in @train_wikiids').revenue.values
y_test = regression_df.query ('wiki_id in @test_wikiids').revenue.values

Finally, you'll need to use a tfidf vectorizer for some of the models which can be done in the following

In [191]:
vectorizer = TfidfVectorizer (sublinear_tf=True,
                              max_features=500,
                              max_df=0.5,
                              min_df=5)

We'll develop nested regression models from the following feature sets:

1. **Length of the movie (M)**: This feature is a scalar value that can be obtained directly by accessing the `runtime` field in the dataframe.

2. **Genres (G)**: Every movie is associated with a dictionary of genres, which can be accessed using the `genres` column in the dataframe. The key in this dictionary is a freebase id and the value is a genre name in plain english. To use genres as features, we'll need to convert the plain english names to a vector. Fortunately, we know how to do this tranformation -- by treating the different genres for a movie as a bag-of-genres vector. Thus, if there are K unique genres in total, the genres feature vector is of size K and the dimensions are either 0 or 1 indicating the presence or absence of a genre.

3. **Plot summaries (S)**: Every movie also contains a summary. We'll featurize the summary by creating a tfidf vector for each summary. The vocabulary size should be capped at 500.

4. **Topics (T)**: Every movie can be represented by a topic mixtures vector which we obtained in the previous section.

**Your turn!** You'll fill the following table for the different nested models with the root mean squared error between the predicted revenue for each model and the actual revenue. State the model that performs best [2 points]


|Models|RMSE|
|------|----|
|M||
|G||
|S||
|T||
|M+G||
|G+S||
|S+T||
|M+G+S||
|G+S+T||
|M+G+S+T||

**Notes:**

- Our regression model will use the features to predict the log of the revenue although the RMSE will be calculated by comparing the actual revenues.

- [RMSE](https://statisticsbyjim.com/regression/root-mean-square-error-rmse/) is a measure to evaluate the performance of a model by comparing the model predictions to the ground truth value. You can calculate RMSE by using [this function from sklearn](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_error.html) setting the `squared` argument to `False`. A smaller RMSE value suggest a more accurate model

- The addition of models here means that the combined model has all the features from the individual models. For example, an M+D model means the features from the M model and the D model are concatenated.

- You may want to use [numpy's hstack](https://numpy.org/doc/stable/reference/generated/numpy.hstack.html) method to concatenate the feature vectors

- You may also want to use the tfidf vectorizer we created earlier to get the tfidf features for each document [See this](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html).



To build the above models, we'll implement the following functions.


**Your turn!** First, a function to train and test a regression model for the different features. By passing different feature matrices, we can reuse this function to train and test all our regression models [0.5 points]

In [212]:
def train_and_test (X_train, y_train, X_test, y_test):
  """ Trains a linear regression model, makes predictions with
  the trained model, and calculates the root mean squared error
  between the predicted values and the true values

  NOTE: Instead of regressing against the revenue, it might be better to predict
  the log of the revenue

  :params:
  :X_train (np.array): The predictors in the regression problem for the training data
  :y_train (np.array): The dependent variable in the regression problem for the training data
  :X_test (np.array): The predictors in the regression problem for the testing data
  :y_test (np.array): The dependent variable in the regression problem for the testing data

  :returns: the following triple as a tuple
  :lr (LinearRegression): The linear regression model obtained from sklearn
  :yhat (np.array): The array of predictions from the regression model on test data
  :err (float): The RMSE error between predicted values and test data outputs
  """

  lr, yhat, err = None, None, None
  # Your code below

  return lr, np.exp(yhat), err

**Your turn!** Next, we'll write a function that transforms the genres to their features [0.5 points]

**Hint!** Use sklearn's countvectorizer to extract the features

In [250]:
def make_genre_features (genres):
  """ Convert the plain english names of the genres from a dictionary
      to bag of words vector.

  :params:
  :genres (np.array): Every item in the array is a string representation of a
                      dictionary, which lists all the genres for the movie.
                      e.g. [{"/m/0lsxr": "Crime"},
                            {"/m/07s9rl0": "sci-fi", "/m/01jfsb":"thriller},
                            ...]

  :returns: pair of objects as tuple
  genre_mat (np.array): A binary matrix of size = num_movies X num_genres;
                        1 represents whether the movie corresponding to the row
                        is of the genre corresponding to the column


  genre2index (dict): A dictionary that maps the genre name to column number


  Note: To convert string representations of a dict to a dict, we can use
  the eval function
  """
  pass
  # Your code below

**Your turn!** We'll also write a utility function that helps preserve the train and test indices from the overall sequence of wikiids [0.5 points]


For example:
Suppose the sequence of wikiids was originally [1234, 5632, 756, 8354, 18792]. When we split into a train and test set, the train sequence turned out to be [8354, 1234, 756] and the test sequence as [18792, 5632]. Our function should give us an array of train indices as [3, 0, 2] and test indices as [4, 1].

In [214]:
def train_and_test_indices (all_ids, train_ids, test_ids):
  """ Find the indices of elements from train_ids and test_ids in the all_ids.

  e.g. all_ids = [1234, 5632, 756, 8354, 18792],
       train_ids = [8354, 1234, 756],
       test_ids = [18792, 5632]

       The output should be two lists:
       train_indices = [3,0,2],
       test_indices = [4,1]

  :params:
  :all_ids (np.array): All the wikipedia ids
  :train_ids (np.array): All the wikipedia ids part of the training set
  :test_ids (np.array): All the wikipedia ids part of the test set

  :returns: the following pair as a tuple
  :train_indices (np.array): The train indices for all the train ids
  :test_indices (np.array): The test indices for all the test ids
  """

  train_indices, test_indices = [], []
  # Your code below

  return train_indices, test_indices

Now let me show you how you can calculate the RMSE with the above functions for one such model -- predicting the revenue by the runtime of the movie. Long movies may not produce great box-office revenue but this predictor is unlikely to be great because there isn't much variability in the runtime of a movie.

In [215]:
# Predict movie revenue using movie runtime
X_train_runtime = np.reshape(regression_df.query ('wiki_id in @train_wikiids').runtime.values, (-1, 1))
X_test_runtime = np.reshape(regression_df.query ('wiki_id in @test_wikiids').runtime.values, (-1, 1))
lr_m, _, err_m = train_and_test (X_train_runtime, y_train, X_test_runtime, y_test)
print (f"RMSE: {err_m:.2f}")

RMSE: 128530237.60


**Your turn!** Fill all the cells below to get the RMSE for individual models [0.5 points]

In [None]:
# Predict the movie revenue using the genres of release.
# This is where you should call make_genre_features and train_test_indices

X_train_genres = None
X_test_genres = None

# Your code below to calculate the feature matrices

# Train and evaluate
lr_g, _, err_g = train_and_test (X_train_genres.toarray(), y_train, X_test_genres.toarray(), y_test)
print (f"RMSE: {err_g:.2f}")

In [None]:
# Predict the movie revenue using the tfidf vectors of text in plot summaries

X_train_summary = None
X_test_summary = None

# Your code below to calculate the feature matrices

# Train and evaluate
lr_s, _, err_s = train_and_test (X_train_summary.toarray(), y_train, X_test_summary.toarray(), y_test)
print (f"RMSE: {err_s:.2f}")

In [219]:
# Predict the movie revenue using the topic vectors for each movie

X_train_topics = None
X_test_topics = None

# Your code below to calculate the feature matrices

# Train and evaluate
lr_t, _, err_t = train_and_test (X_train_topics, y_train, X_test_topics, y_test)
print (f"RMSE: {err_t:.2f}")

RMSE: 119915377.33


In [None]:
# Predict the movie revenue using movie length and the genre features
# Note: When creating a combined feature matrix, you may want to convert the
# individual feature matrices as numpy arrays
X_train_runtime_genres = None
X_test_runtime_genres = None

# Your code below to calculate the feature matrices

# Train and evaluate
lr_m_g, _, err_m_g = train_and_test (X_train_runtime_genres,
                                     y_train,
                                     X_test_runtime_genres,
                                     y_test)
print (f"RMSE: {err_m_g:.2f}")

In [None]:
# Predict the movie revenue using movie length and the decade features
# Note: When creating a combined feature matrix, you may want to convert the
# individual feature matrices as numpy arrays
import scipy.sparse
X_train_genres_summary = None
X_test_genres_summary = None

# Your code below to calculate the feature matrices

# Train and evaluate
lr_g_s, _, err_g_s = train_and_test (scipy.sparse.csr_matrix (X_train_genres_summary),
                                     y_train,
                                     scipy.sparse.csr_matrix (X_test_genres_summary),
                                     y_test)
print (f"RMSE: {err_g_s:.2f}")

In [None]:
# Predict the movie revenue using movie length and the decade features
# Note: When creating a combined feature matrix, you may want to convert the
# individual feature matrices as numpy arrays
import scipy.sparse
X_train_summary_topics = None
X_test_summary_topics = None

# Your code below to calculate the feature matrices

# Train and evaluate
lr_s_t, _, err_s_t = train_and_test (scipy.sparse.csr_matrix (X_train_summary_topics),
                                     y_train,
                                     scipy.sparse.csr_matrix (X_test_summary_topics),
                                     y_test)
print (f"RMSE: {err_s_t:.2f}")

In [None]:
# Predict the movie revenue using movie length, genre, and plot summaries
# Note: When creating a combined feature matrix, you may want to convert the
# individual feature matrices as numpy arrays
import scipy.sparse
X_train_runtime_genres_summary = None
X_test_runtime_genres_summary = None

# Your code below to calculate the feature matrices

# Train and evaluate
lr_m_g_s, _, err_m_g_s = train_and_test (scipy.sparse.csr_matrix (X_train_runtime_genres_summary),
                                     y_train,
                                     scipy.sparse.csr_matrix (X_test_runtime_genres_summary),
                                     y_test)
print (f"RMSE: {err_m_g_s:.2f}")

In [None]:
# Predict the movie revenue using genre, plot summaries, and topics
# Note: When creating a combined feature matrix, you may want to convert the
# individual feature matrices as numpy arrays
import scipy.sparse
X_train_genres_summary_topics = None
X_test_genres_summary_topics = None

# Your code below

# Train and evaluate
lr_g_s_t, _, err_g_s_t = train_and_test (scipy.sparse.csr_matrix (X_train_genres_summary_topics),
                                     y_train,
                                     scipy.sparse.csr_matrix (X_test_genres_summary_topics),
                                     y_test)
print (f"RMSE: {err_g_s_t:.2f}")

In [None]:
# Predict the movie revenue using all the features
# Note: When creating a combined feature matrix, you may want to convert the
# individual feature matrices as numpy arrays
import scipy.sparse
X_train_all = None
X_test_all = None

# Your code below

# Train and evaluate
lr_all, _, err_all= train_and_test (scipy.sparse.csr_matrix (X_train_all),
                                     y_train,
                                     scipy.sparse.csr_matrix (X_test_all),
                                     y_test)
print (f"RMSE: {err_all:.2f}")

## Extra credit [1 point]

Try to reduce the RMSE of the best-performing model. Potential ways to reduce RMSE:

- Include more topics
- Include more genres
- Perform non-linear regression

Try one of the above or your own idea to reduce RMSE

## Confidence intervals over coefficients [2 points]

Our regression model gives us the coefficients associated with the features. In this section, we'll obtain the confidence intervals of these coefficients using bootstrapping. To obtain the confidence intervals around any coefficient, we would need to train the model on different samples obtained by sampling with replacement and calculate the CI empirically.

**Your turn!** Write a function to calculate the coefficients of a linear regression model on some training data a given number of times [0.5 points]

In [272]:
def bootstrapped_regression (X_train, y_train, num_bootstraps=1000):
  """ Fit linear models on bootstrapped samples

  :params:
  X_train (np.array): The train feature matrix
  y_train (np.array): The train output values
  num_bootstraps (int): 100

  :returns:
  coeffs (list): The coeffcients for all the features for every boostrap sample
  """

  coeffs = list ()
  # Your code below

  return coeffs

**Your turn!** Write a function to calculate the empirical confidence interval for a coefficient estimate [0.5 points]

In [269]:
def get_empirical_CI (values, lower=0.025, upper=.975):
  """ Calculate the empirical confidence interval from the given values

  :params:
  :values (np.array): Values of bootstrapped estimates
  :lower (float): The lower percentile (default:2.5)
  :upper (float): The upper percentile (default:97.5)

  :returns: pair as a tuple
  :lower_bound (float): The lower bound of the CI
  :upper_bound (float): The upper bound of the CI

  """
  # Your code below
  pass

**Your turn!** Calcualte the CI for coefficient estimates for the following

a. runtime [0.5 points]

b. genre corresponding to comedy [0.5 points]

What can you tell about the statistical significance of the estimate from the confidence interval for both the variables?

**Note:** Do 1000 bootstrapped intervals

In [273]:
coeffs = bootstrapped_regression (X_train_runtime, y_train, num_bootstraps=1000)
lb, ub = get_empirical_CI (np.array([item[0] for item in coeffs]))

100%|██████████| 1000/1000 [00:01<00:00, 806.03it/s]
