# Netflix Ratings: IMDB Score Predictors!

![Image of Netflix](https://help.nflxext.com/0af6ce3e-b27a-4722-a5f0-e32af4df3045_what_is_netflix_5_en.png)

# Looking at the 'Big Picture'...

### Frame the Problem

**What is the objective of this project?**
The objective for this project is to look at trends in the Netflix library of original content.

**How will the solution be used?**
The output of this project should be able to predict the IMDB Score of a particular Netflix title based on a variety of qualitative (Genre, Director, Language) and quantitative (Runtime, Days Since Premiere) features.

**How should the problem be framed?**
We should use a regression algorithm as we are looking to predict a continuous quantitative measure.

### Selecting a Performance Measure

**How should performance be measured?**
Performance will be measured using Root Mean Squared Error and Mean Absolute Error. 

$ RMSE(\textbf{X}, \textit{h}) = \sqrt{\frac{1}{m} \sum_{i=1}^{m} ( h(\textbf{X}^{(i)}) - y^{(i)} )^{2} } $

$ MAE(\textbf{X}, \textit{h}) = \frac{1}{m} \sum_{i=1}^{m} | h(\textbf{X}^{(i)}) - y^{(i)} |  $

**Is the performance measure aligned with business objectives?**
The method of performance measurement should give *reasonable confidence* that the predictor is yielding results that give the best possible estimation based on the most impactful features.

### Other Discovery Questions

* What would be the minimum performance needed to reach the business objective?
* What are comparable problems? Can you resuse experience or tools?
* Is human expertise available?
* How would you solve the problem manually?
* List the assumptions made so far
* Verify assumptions if possible

Load initial packages and set parameters for matplotlib

In [None]:
# To support both python 2 and python 3
from __future__ import division, print_function, unicode_literals

# Common imports
import numpy as np #will need this to manipulate arrays
import pandas as pd #will need this to handle DataFrames and Series
import os #will help access file paths when retreiving data
import datetime #will help handle dates
from dateutil.relativedelta import relativedelta
from datetime import date

# To plot pretty figures
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
plt.rcParams['axes.labelsize'] = 14
plt.rcParams['xtick.labelsize'] = 12
plt.rcParams['ytick.labelsize'] = 12

import seaborn as sns
from scipy import stats

from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Where to save the figures
PROJECT_ROOT_DIR = "."
CHAPTER_ID = "end_to_end_project"
IMAGES_PATH = os.path.join(PROJECT_ROOT_DIR, "images", CHAPTER_ID)

def save_fig(fig_id, tight_layout=True, fig_extension="png", resolution=300):
    path = os.path.join(IMAGES_PATH, fig_id + "." + fig_extension)
    print("Saving figure", fig_id)
    if tight_layout:
        plt.tight_layout()
    plt.savefig(path, format=fig_extension, dpi=resolution)

# Ignore useless warnings (see SciPy issue #5998)
import warnings
warnings.filterwarnings(action="ignore", module="scipy", message="^internal gelsd")

# Get the Data

**List the data you need and how much you need**

Will need data from Netflix at the grain of single movies/shows/programs.

**Find and document where you can get that data**

We can get the data from Kaggle. Here are some good candidates:

https://www.kaggle.com/luiscorter/netflix-original-films-imdb-scores *[Using this so far]*

https://www.kaggle.com/shivamb/netflix-shows *[TBD! Will use this in V2]*

**Other data questions to check**

Check how much space it will take.

Check legal obligations, and get authorization if necessary.

Get access authorizations.

Create workspace (with enough storage space).

**Get the data (forreal!)**

Ok - now let's get the data. In this case I've saved this file to my local computer. Depending on the volume of data, it might be smarter to only extract some of this data.

In [None]:
# init_path = '../input/netflix-original-films-imdb-scores/NetflixOriginals.csv'

# read_csv('../input/netflix-original-films-imdb-scores/NetflixOriginals.csv')

# for dirname, _, filenames in os.walk('/kaggle/input'):
#     for filename in filenames:
#         print(os.path.join(dirname, filename))
        
# /Users/blakenicholson/Documents/Personal/Coding/handson-ml/datasets/NetflixOriginals.csv

HOUSING_PATH = '../input/netflix-original-films-imdb-scores/NetflixOriginals.csv'


def load_netflixorigs_data(housing_path=HOUSING_PATH):
    csv_path = housing_path
    return pd.read_csv(csv_path)

netflix_data = load_netflixorigs_data()


** Check how much space it will take **

In [None]:
netflix_data.shape

## Checking out the data structure

Ideally before jumping into this step, we'd have asked the experts about the data.

- Create a copy of the data for exploration.

- Create a Jupyter Notebook to keep a record of your data exploration.

- Study each attribute and its characterisitics:
    - Name
    - Type (categorical, int/float, bounded/unbounded, text, structured, etc.)
    - % of missing values
    - Noisiness and type of noise (stochastic, outliers, rounding errors, etc.)
    - Possibly useful for the task
    - Types of distribution (Gaussian, uniform, logarithmic, etc.)

- For supervised learning tasks, identify the target attribute(s).
    - Visualize the data
    - Study the correlation between attributes.
    - Study how you would solve the problem manually.
    - Identfiy the promising transformations you may want to apply.
    - Identify extra data that would be useful.
    - Document what you learn.

**Prepare the Data**

Notes:

- Work on copies of the data (keep the original dataset intact)

- Write functions for all data transformations you apply, for five reasons:
    - So you can easily prepare the data the next time you get a fresh dataset
    - So you can apply these transformations in future projects.
    - To clean and prepare the test set.
    - To clean and prepare new data instances once your solution is live.
    - To make it easy to treat your preparation choices as hyperparameters.

1. Data cleaning:
    - Fix or remove outliers (optional).
    - Fill in missing values (e.g., with zero, mean, median,...) or drop their rows (or columns)

2. Feature selection (optional):
    - Drop the attributes that provide no useful information for the task.

3. Feature Engineering, where appropriate:
    - Discretize continuous features.
    - Decompose features (e.g., categorical, date/time, etc.).
    - Add promising transformations of features (e.g., log(x), sqrt(x), x^2, etc.)
    - Aggregate features into promising new features.

4. Feature scaling: standardize or normalize features.

Check out the first few rows in the dataframe

In [None]:
netflix_data.head()

Look at the descriptive statistics of the dataframe

In [None]:
netflix_data.describe()

In [None]:
print('Mean of Runtime: {0}'.format(np.mean(netflix_data['Runtime'])))

The first thing we notice is that we only have two continuous measures (*Runtime* and *IMDB Score*), so we may need to create more features to work with.

In [None]:
netflix_data.info()

We are going to create another continuous feature that looks at the number of days between the Premiere date and today's date. We will create the 'days_since_premiere' column and add it back into our dataframe.

In [None]:
date1 = netflix_data["Premiere"]
date2 = pd.Series(data=date.today(), index=np.arange(len(date1)), name="Today").values.astype('datetime64[D]')
date1 = pd.to_datetime(date1).values.astype('datetime64[D]')
date_df = pd.DataFrame(dict(Start_date = date1, End_date = date2))
date_df['diff_days'] = date_df['End_date'] - date_df['Start_date']
date_df['diff_days'] = date_df['diff_days'] / np.timedelta64(1,'D')
netflix_data['days_since_premiere'] = date_df['diff_days']
netflix_data.head()

Let's now look at some histograms of each measure to see what our distributions look like. IMDB Score and Runtime both seem to follow a normal distribution but days_since_premiere seems to follow a negative linear trend (which makes sense as more time goes by and the oldest movies continue to get older).

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
netflix_data.hist(bins=50, figsize=(20,15))
# save_fig("melb_attribute_histogram_plots")
plt.show()

In [None]:
# netflix_data["Runtime"].hist(bins=100)

fig, ax = plt.subplots(1,1,figsize=(10, 4))

sns.distplot(netflix_data['Runtime'])

In [None]:
netflix_data.corr()

In [None]:
netflix_data.plot(kind="scatter", x="IMDB Score", y="Runtime", alpha=0.4, s=netflix_data["Runtime"]/10)

In [None]:
netflix_data.plot(kind="scatter", x="IMDB Score", y="days_since_premiere", alpha=0.4, s=netflix_data["Runtime"]/10)

# Create a Test Set

In [None]:
from sklearn.model_selection import train_test_split

train_set, test_set = train_test_split(netflix_data, test_size=0.2, random_state=42)

In [None]:
print(len(train_set), "train +", len(test_set), "test")

In [None]:
test_set.head()
test_set.shape

In [None]:
train_set.head()
train_set.shape

In [None]:
netflix_data["Runtime"].hist()

In [None]:
netflix_data["runtime_cat"] = np.ceil(netflix_data["Runtime"] / 20)
netflix_data["runtime_cat"].where(netflix_data["runtime_cat"] < 7, 7, inplace=True)
# netflix_data["runtime_cat"].where(netflix_data["runtime_cat"] < 7, 7, inplace=True)

In [None]:
netflix_data.hist(bins=50, figsize=(20,15))
# save_fig("melb_attribute_histogram_plots")
plt.show()

In [None]:
y = netflix_data["Runtime"]
y.head()

In [None]:
X = netflix_data.loc[:,['IMDB Score','Runtime']]
X.head()

In [None]:
from sklearn.model_selection import StratifiedShuffleSplit

split = StratifiedShuffleSplit(n_splits=1,test_size=0.10,random_state=42)

# netflix_data.iloc[0:,1]
for train_index, test_index in split.split(netflix_data, netflix_data["runtime_cat"]):
    strat_train_set = netflix_data.loc[train_index]
    strat_test_set = netflix_data.loc[test_index]

In [None]:
print(split)

In [None]:
type(split)

In [None]:
strat_test_set["runtime_cat"].value_counts() / len(strat_test_set)

In [None]:
train_set.plot(kind="scatter", x="IMDB Score", y="Runtime", alpha=0.4, s=train_set["Runtime"]/10)

In [None]:
test_set.plot(kind="scatter", x="IMDB Score", y="Runtime", alpha=0.4, s=test_set["Runtime"]/10)

In [None]:
def genre_proportions(data):
    return data["Genre"].value_counts() / len(data)

def language_proportions(data):
    return data["Language"].value_counts() / len(data)

train_set, test_set = train_test_split(netflix_data, test_size=0.2, random_state=42)

# compare_props = pd.DataFrame({
#     "Overall": genre_proportions(netflix_data),
#     "Stratified": genre_proportions(strat_test_set),
#     "Random": genre_proportions(test_set),
# }).sort_index()

compare_props = pd.DataFrame({
    "Overall": language_proportions(netflix_data),
    "Stratified": language_proportions(strat_test_set),
    "Random": language_proportions(test_set),
}).sort_index()

compare_props["Rand. %error"] = 100 * compare_props["Random"] / compare_props["Overall"] - 100
compare_props["Strat. %error"] = 100 * compare_props["Stratified"] / compare_props["Overall"] - 100

In [None]:
compare_props.sort_values(by="Overall", ascending=False)

# Discover and visualize the data to gain insights

In [None]:
netflix = test_set.copy()
netflix.head()

In [None]:
netflix.plot(kind="scatter", x="Runtime", y="IMDB Score")

In [None]:
netflix.plot(kind="scatter", x="Runtime", y="IMDB Score", alpha=0.5)

In [None]:
netflix.plot(kind="scatter", x="Runtime", y="IMDB Score", alpha=0.5,
    s=netflix["IMDB Score"], label="IMDB Score", figsize=(10,7),
    c="IMDB Score", cmap=plt.get_cmap("jet"), colorbar=True,
    sharex=False)
plt.legend()

In [None]:
corr_matrix = netflix.corr()

In [None]:
corr_matrix["IMDB Score"].sort_values(ascending=False)

In [None]:
# from pandas.tools.plotting import scatter_matrix # For older versions of Pandas
from pandas.plotting import scatter_matrix

attributes = ["IMDB Score", "runtime_cat", "Runtime","days_since_premiere"]
scatter_matrix(netflix[attributes], figsize=(12, 8))
# save_fig("scatter_matrix_plot")

In [None]:
netflix.plot(kind="scatter", x="days_since_premiere", y="Runtime",
             alpha=0.1)
plt.axis([0, 3000, 0, 200])

# Prepare the data for Machine Learning Algorithms

In [None]:
netflix = strat_train_set.drop("Runtime", axis=1) # drop labels for training set
netflix_labels = strat_train_set["Runtime"].copy()

In [None]:
sample_incomplete_rows = netflix[netflix.isnull().any(axis=1)].head()
sample_incomplete_rows

In [None]:
median = netflix["IMDB Score"].median()
sample_incomplete_rows["IMDB Score"].fillna(median, inplace=True) # option 3
sample_incomplete_rows

In [None]:
from sklearn.impute import SimpleImputer

imputer = SimpleImputer(strategy="median")

Remove the text attribute because median can only be calculated on numerical attributes:

In [None]:
# housing_num = housing.drop('ocean_proximity', axis=1)
netflix_num = netflix.select_dtypes(include=[np.number])
netflix_num.head()

In [None]:
imputer.fit(netflix_num)

In [None]:
imputer.statistics_

In [None]:
netflix_num.median().values

In [None]:
X = imputer.transform(netflix_num)

In [None]:
netflix_tr = pd.DataFrame(X, columns=netflix_num.columns,
                          index = list(netflix.index.values))

In [None]:
netflix_tr.loc[sample_incomplete_rows.index.values]

In [None]:
imputer.strategy

In [None]:
netflix_tr = pd.DataFrame(X, columns=netflix_num.columns)
netflix_tr.head()

Now let's preprocess the categorical input feature, ocean_proximity:

In [None]:
# housing_cat = housing[['Suburb','Type','Method','SellerG','CouncilArea','Regionname']]
netflix_cat = netflix['Genre']
netflix_cat.head()

In [None]:
from sklearn.preprocessing import OrdinalEncoder

In [None]:
netflix_cat.describe()

In [None]:
ordinal_encoder = OrdinalEncoder()
netflix_cat_encoded = ordinal_encoder.fit_transform(netflix_cat.values.reshape(1,-1))
netflix_cat_encoded[:10]

In [None]:
ordinal_encoder.categories_

In [None]:
from sklearn.preprocessing import OneHotEncoder

cat_encoder = OneHotEncoder()
netflix_cat_1hot = cat_encoder.fit_transform(netflix_cat.values.reshape(1,-1))
netflix_cat_1hot

By default, the OneHotEncoder class returns a sparse array, but we can convert it to a dense array if needed by calling the toarray() method:

In [None]:
netflix_cat_1hot.toarray()

Alternatively, you can set `sparse=False` when creating the `OneHotEncoder`:

In [None]:
cat_encoder = OneHotEncoder(sparse=False)
netflix_cat_1hot = cat_encoder.fit_transform(netflix_cat.values.reshape(1,-1))
netflix_cat_1hot

In [None]:
cat_encoder.categories_

Let's create a custom transformer to add extra attributes:

"IMDB Score", "runtime_cat", "Runtime","days_since_premiere"

In [None]:
# from sklearn.base import BaseEstimator, TransformerMixin

# # column index
# imdb_ix, runtime_ix, days_since_premiere_ix = 3, 4, 5, 6

# class CombinedAttributesAdder(BaseEstimator, TransformerMixin):
#     def __init__(self, add_days_since_premiere = True): # no *args or **kargs
#         self.add_days_since_premiere = add_days_since_premiere
#     def fit(self, X, y=None):
#         return self  # nothing else to do
#     def transform(self, X, y=None):
#         days_since_premiere = X[:, rooms_ix] / X[:, household_ix]
#             date1 = netflix_data["Premiere"]
#             date2 = pd.Series(data=date.today(), index=np.arange(len(date1)), name="Today").values.astype('datetime64[D]')
#             date1 = pd.to_datetime(date1).values.astype('datetime64[D]')
#             date_df = pd.DataFrame(dict(Start_date = date1, End_date = date2))
#             date_df['diff_days'] = date_df['End_date'] - date_df['Start_date']
#             date_df['diff_days'] = date_df['diff_days'] / np.timedelta64(1,'D')
#             netflix_data['days_since_premiere'] = date_df['diff_days']
#             netflix_data.head()
        
#         population_per_household = X[:, population_ix] / X[:, household_ix]
#         if self.add_bedrooms_per_room:
#             bedrooms_per_room = X[:, bedrooms_ix] / X[:, rooms_ix]
#             return np.c_[X, rooms_per_household, population_per_household,
#                          bedrooms_per_room]
#         else:
#             return np.c_[X, rooms_per_household, population_per_household]

# attr_adder = CombinedAttributesAdder(add_bedrooms_per_room=False)
# housing_extra_attribs = attr_adder.transform(housing.values)

In [None]:
# housing_extra_attribs = pd.DataFrame(
#     housing_extra_attribs,
#     columns=list(housing.columns)+["rooms_per_household", "population_per_household"])
# housing_extra_attribs.head()

Now let's build a pipeline for preprocessing the numerical attributes:

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

num_pipeline = Pipeline([
        ('imputer', SimpleImputer(strategy="median")),
#         ('attribs_adder', CombinedAttributesAdder()),
        ('std_scaler', StandardScaler()),
    ])

netflix_num_tr = num_pipeline.fit_transform(netflix_num)

In [None]:
netflix_num_tr

In [None]:
from sklearn.base import BaseEstimator, TransformerMixin

# Create a class to select numerical or categorical columns 
# since Scikit-Learn doesn't handle DataFrames yet
class DataFrameSelector(BaseEstimator, TransformerMixin):
    def __init__(self, attribute_names):
        self.attribute_names = attribute_names
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        return X[self.attribute_names].values

Now let's join all these components into a big pipeline that will preprocess both the numerical and the categorical features:

In [None]:
num_attribs = list(netflix_num)
cat_attribs = ["Genre"]

num_pipeline = Pipeline([
        ('selector', DataFrameSelector(num_attribs)),
        ('imputer', SimpleImputer(strategy="median")),
#         ('attribs_adder', CombinedAttributesAdder()),
        ('std_scaler', StandardScaler()),
    ])

cat_pipeline = Pipeline([
        ('selector', DataFrameSelector(cat_attribs)),
        ('cat_encoder', OneHotEncoder(sparse=False, handle_unknown='ignore')),
    ])

In [None]:
from sklearn.pipeline import FeatureUnion

full_pipeline = FeatureUnion(transformer_list=[
        ("num_pipeline", num_pipeline),
        ("cat_pipeline", cat_pipeline),
    ])

In [None]:
netflix_prepared = full_pipeline.fit_transform(netflix)
netflix_prepared

In [None]:
netflix_prepared.shape

In [None]:
netflix_labels.shape

# Select and train a model

In [None]:
from sklearn.linear_model import LinearRegression

lin_reg = LinearRegression()
lin_reg.fit(netflix_prepared, netflix_labels)

In [None]:
# let's try the full pipeline on a few training instances
some_data = netflix.iloc[:5]
some_labels = netflix_labels.iloc[:5]
some_data_prepared = full_pipeline.transform(some_data)

print("Predictions:", lin_reg.predict(some_data_prepared))

Compare against the actual values:

In [None]:
print("Labels:", list(some_labels))

In [None]:
some_data_prepared

In [None]:
from sklearn.metrics import mean_squared_error

netflix_predictions = lin_reg.predict(netflix_prepared)
lin_mse = mean_squared_error(netflix_labels, netflix_predictions)
lin_rmse = np.sqrt(lin_mse)
lin_rmse

In [None]:
from sklearn.metrics import mean_absolute_error

lin_mae = mean_absolute_error(netflix_labels, netflix_predictions)
lin_mae

In [None]:
from sklearn.tree import DecisionTreeRegressor

tree_reg = DecisionTreeRegressor(random_state=42)
tree_reg.fit(netflix_prepared, netflix_labels)

In [None]:
netflix_predictions = tree_reg.predict(netflix_prepared)
tree_mse = mean_squared_error(netflix_labels, netflix_predictions)
tree_rmse = np.sqrt(tree_mse)
tree_rmse

# Fine-tune the model

In [None]:
from sklearn.model_selection import cross_val_score

scores = cross_val_score(tree_reg, netflix_prepared, netflix_labels,
                         scoring="neg_mean_squared_error", cv=10)
tree_rmse_scores = np.sqrt(-scores)

In [None]:
def display_scores(scores):
    print("Scores:", scores)
    print("Mean:", scores.mean())
    print("Standard deviation:", scores.std())

display_scores(tree_rmse_scores)

In [None]:
lin_scores = cross_val_score(lin_reg, netflix_prepared, netflix_labels,
                             scoring="neg_mean_squared_error", cv=10)
lin_rmse_scores = np.sqrt(-lin_scores)
display_scores(lin_rmse_scores)

In [None]:
from sklearn.ensemble import RandomForestRegressor

forest_reg = RandomForestRegressor(random_state=42)
forest_reg.fit(netflix_prepared, netflix_labels)

In [None]:
netflix_predictions = forest_reg.predict(netflix_prepared)
forest_mse = mean_squared_error(netflix_labels, netflix_predictions)
forest_rmse = np.sqrt(forest_mse)
forest_rmse

In [None]:
from sklearn.model_selection import cross_val_score

forest_scores = cross_val_score(forest_reg, netflix_prepared, netflix_labels,
                                scoring="neg_mean_squared_error", cv=10)
forest_rmse_scores = np.sqrt(-forest_scores)
display_scores(forest_rmse_scores)

In [None]:
scores = cross_val_score(lin_reg, netflix_prepared, netflix_labels, scoring="neg_mean_squared_error", cv=10)
pd.Series(np.sqrt(-scores)).describe()

In [None]:
from sklearn.svm import SVR

svm_reg_rbf = SVR(kernel="rbf")
svm_reg_rbf.fit(netflix_prepared, netflix_labels)
netflix_predictions = svm_reg_rbf.predict(netflix_prepared)
svm_mse = mean_squared_error(netflix_labels, netflix_predictions)
svm_rmse = np.sqrt(svm_mse)
svm_rmse

In [None]:
from sklearn.svm import SVR

svm_reg = SVR(kernel="linear")
svm_reg.fit(netflix_prepared, netflix_labels)
housing_predictions = svm_reg.predict(netflix_prepared)
svm_mse = mean_squared_error(netflix_labels, netflix_predictions)
svm_rmse = np.sqrt(svm_mse)
svm_rmse

In [None]:
from sklearn.model_selection import GridSearchCV

param_grid = [
    # try 12 (3×4) combinations of hyperparameters
    {'n_estimators': [3, 10, 30], 'max_features': [2, 4, 6, 8]},
    # then try 6 (2×3) combinations with bootstrap set as False
    {'bootstrap': [False], 'n_estimators': [3, 10], 'max_features': [2, 3, 4]},
  ]

forest_reg = RandomForestRegressor(random_state=42)
# train across 5 folds, that's a total of (12+6)*5=90 rounds of training 
grid_search = GridSearchCV(forest_reg, param_grid, cv=5,
                           scoring='neg_mean_squared_error', return_train_score=True)
grid_search.fit(netflix_prepared, netflix_labels)

The best hyperparameter combination found:

In [None]:
grid_search.best_params_

In [None]:
grid_search.best_estimator_

Let's look at the score of each hyperparameter combination tested during the grid search:

In [None]:
cvres = grid_search.cv_results_
for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
    print(np.sqrt(-mean_score), params)

In [None]:
pd.DataFrame(grid_search.cv_results_)

In [None]:
final_model = grid_search.best_estimator_

X_test = strat_test_set.drop("Runtime", axis=1)
y_test = strat_test_set["Runtime"].copy()

X_test_prepared = full_pipeline.transform(X_test)
final_predictions = final_model.predict(X_test_prepared)

final_mse = mean_squared_error(y_test, final_predictions)
final_rmse = np.sqrt(final_mse)

In [None]:
final_rmse

In [None]:
from scipy import stats

In [None]:
confidence = 0.95
squared_errors = (final_predictions - y_test) ** 2
mean = squared_errors.mean()
m = len(squared_errors)

np.sqrt(stats.t.interval(confidence, m - 1,
                         loc=np.mean(squared_errors),
                         scale=stats.sem(squared_errors)))

In [None]:
tscore = stats.t.ppf((1 + confidence) / 2, df=m - 1)
tmargin = tscore * squared_errors.std(ddof=1) / np.sqrt(m)
np.sqrt(mean - tmargin), np.sqrt(mean + tmargin)

Alternatively, we could use a z-scores rather than t-scores:

In [None]:
zscore = stats.norm.ppf((1 + confidence) / 2)
zmargin = zscore * squared_errors.std(ddof=1) / np.sqrt(m)
np.sqrt(mean - zmargin), np.sqrt(mean + zmargin)

#  ----