In [None]:
# Generating syntetic database can be use to train machine learning

"""
For binary classification.

The dataset will have the following specifications:

Task: Binary Classification
Features: 10 numerical features, generated based on a mix of distributions to simulate real-world data complexity
Samples: 1000 data points
Labels: 0 or 1, with approximately equal distribution
"""

import numpy as np
import pandas as pd
from sklearn.datasets import make_classification

# Settings
n_samples = 1000
n_features = 10
n_classes = 2

# Generate synthetic dataset
X, y = make_classification(n_samples=n_samples, n_features=n_features, n_informative=n_features-2, n_redundant=0, n_classes=n_classes, flip_y=0.01, class_sep=1.5, random_state=42)

# Convert to pandas DataFrame for ease of use
df = pd.DataFrame(X, columns=[f'feature_{i+1}' for i in range(n_features)])
df['label'] = y

# Save dataset to CSV file (optional)
df.to_csv('synthetic_classification_dataset.csv', index=False)

# Print the first few rows to check
print(df.head())


In [None]:
"""
For gradient problems.

Let's generate a synthetic dataset with the following specifications:

Task: Regression
Features: 10 numerical features, generated to simulate real-world scenarios with a mix of linear and non-linear relationships.
Samples: 1000 data points
Target: A continuous value derived from the features plus some noise to simulate real-world data variance.
"""
import numpy as np
import pandas as pd

# Settings
n_samples = 1000
n_features = 10
noise_level = 0.1

# Generate synthetic features
np.random.seed(42)
X = np.random.rand(n_samples, n_features)

# Generate a target variable with a mix of linear and non-linear relationships
# Example: target = 3*feature_1 + 2*np.sin(feature_2) - log(1 + feature_3) + noise
noise = np.random.normal(0, noise_level, n_samples)
target = 3*X[:, 0] + 2*np.sin(X[:, 1]) - np.log(1 + np.abs(X[:, 2])) + noise

# Convert to pandas DataFrame for ease of use
features_df = pd.DataFrame(X, columns=[f'feature_{i+1}' for i in range(n_features)])
target_df = pd.DataFrame(target, columns=['target'])

# Combine features and target into one DataFrame
df = pd.concat([features_df, target_df], axis=1)

# Save dataset to CSV file (optional)
df.to_csv('synthetic_regression_dataset.csv', index=False)

# Print the first few rows to check
print(df.head())


# Ensembel

Ensemble methods are techniques in machine learning that combine the predictions from multiple models to improve robustness, reduce variance, and increase accuracy compared to any single model used within the ensemble. These methods are particularly useful for complex prediction tasks where it's challenging to achieve high performance with a single model. Let's delve into some of the most common ensemble methods: bagging, boosting, stacking, and blending, along with an overview of others.

### 1. Bagging (Bootstrap Aggregating)
- **Principle:** Bagging involves training multiple models of the same type on different subsets of the training data. These subsets are created by randomly sampling the original dataset with replacement (bootstrap samples). The final prediction is typically the average of the predictions (for regression tasks) or the majority vote (for classification tasks).
- **Example:** Random Forest is a popular bagging-based ensemble method that uses multiple decision trees.

### 2. Boosting
- **Principle:** Boosting is a sequential process where each model attempts to correct the errors of the previous models. The data points that were misclassified or had a higher error by previous models are given more weight, so subsequent models focus more on difficult cases. This process continues until a specified number of models are created or no further improvements can be made.
- **Examples:** AdaBoost (Adaptive Boosting) and Gradient Boosting are well-known boosting methods. XGBoost, LightGBM, and CatBoost are advanced implementations that are highly popular in machine learning competitions due to their performance and speed.

### 3. Stacking (Stacked Generalization)
- **Principle:** Stacking involves training a new model (meta-learner or blender) to combine the predictions of several different models. The original models are trained on the full dataset, and then their predictions are used as inputs to train the meta-learner to produce the final prediction. This method leverages the strength of each base model and reduces bias and variance.
- **Example:** The base level can include diverse models like decision trees, SVMs, and neural networks, while the meta-learner could be a logistic regression model.

### 4. Blending
- **Principle:** Blending is similar to stacking but with a slight difference in how the training set for the meta-learner is created. Instead of using out-of-fold predictions from the base models (as in stacking), blending uses a holdout set (a validation set that is not part of the cross-validation used to train the base models) to train the meta-learner.
- **Example:** If the dataset is split into 80% training and 20% test data, the training data might be further split, with 70% used to train the base models and 10% as a holdout set for the blender.

### Other Ensemble Methods
- **Voting:** Simplest form of ensemble, where the predictions from multiple models are combined through a majority vote (for classification) or average (for regression).
- **Snapshot Ensembling:** Involves saving snapshots of a single model at different epochs during training, particularly when the model's performance on a validation set improves. These snapshots are then averaged to make the final prediction.
- **Model Averaging:** A straightforward approach where the final prediction is the average of the predictions from multiple models. It's a simple but often effective method, especially when the models are diverse.

Ensemble methods can significantly improve prediction performance by combining the strengths and reducing the weaknesses of individual models. However, they also tend to increase computational complexity and training time. It's essential to balance performance improvements with computational costs, especially in real-world applications where resources might be limited.

[sklearn](https://scikit-learn.org/stable/user_guide.html)

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib import rcParams

import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objects as go

In [None]:
# Pie plot for target

colors = ['gold', 'mediumturquoise']
labels = ['0','1']
values = df['Outcome'].value_counts()/df['Outcome'].shape[0]

# Use `hole` to create a donut-like pie chart
fig = go.Figure(data=[go.Pie(labels=labels, values=values, hole=.3)])
fig.update_traces(hoverinfo='label+percent', textinfo='percent', textfont_size=20,
                  marker=dict(colors=colors, line=dict(color='#000000', width=2)))
fig.update_layout(
    title_text="Outcome")
fig.show()


In [None]:
feature_names = [cname for cname in df.loc[:,:'Age'].columns]

# Histogram
rcParams['figure.figsize'] = 40,60
sns.set(font_scale = 3)
sns.set_style("white")
sns.set_palette("bright")
plt.subplots_adjust(hspace=0.5)
i = 1;
for name in feature_names:
    plt.subplot(5,2,i)
    sns.histplot(data=df, x=name, hue="Outcome",kde=True,palette="YlGnBu")
    i = i + 1

In [None]:
# Histogram 2

# 'bin_edges' is a list of bin intervals
count, bin_edges = np.histogram(df['2013'])

df['2013'].plot(kind='hist', figsize=(8, 5), xticks=bin_edges)

plt.title('Histogram of Immigration from 195 countries in 2013') # add a title to the histogram
plt.ylabel('Number of Countries') # add y-label
plt.xlabel('Number of Immigrants') # add x-label
plt.show()

In [None]:
# Pair plot
sns.set(font_scale=2)
plt.figure(figsize=(10, 8))
sns.set_style("white")
sns.set_palette("bright")
sns.pairplot(df,kind = 'reg',corner = True,palette ='YlGnBu' )

In [None]:
# boxplot
fig = px.histogram(df, x="Glucose", 
                   color="Outcome", 
                   marginal="box",
                   barmode ="overlay",
                   histnorm ='density'
                  )  
fig.update_layout(
    title_font_color="black",
    legend_title_font_color="green",
    title={
        'text': "Glucose Histogram per Outcome",
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'},
)
fig.show()

In [None]:
# testing relevant feature
from sklearn.feature_selection import SelectKBest, f_classif

# Assuming 'X_transformed' is the output from your preprocessor and 'y' is your target variable
# Note: Adjust this example if you're using the pipeline approach to directly fit on raw data

# Fit SelectKBest
selector = SelectKBest(score_func=f_classif, k='all')  # 'all' to keep all features for demonstration
X_new = selector.fit(X_transformed, y_train)

# Get scores and p-values
scores = selector.scores_
p_values = selector.pvalues_

# Otherwise, if you're directly applying SelectKBest after manual preprocessing:
feature_names = ['age', 'pregnancies', 'bmi', 'skinthickness', 'insulin', 'glucose',
       'bloodpressure', 'diabetespedigreefunction']  # Fill this with your actual feature names

# Create a DataFrame to display scores and p-values
import pandas as pd
feature_scores = pd.DataFrame({
    'Feature': feature_names,
    'ANOVA F-Score': scores,
    'p-value': p_values
}).sort_values(by='ANOVA F-Score', ascending=False)

print(feature_scores)

In [None]:
# corplot
corr=df.corr().round(2)

sns.set_theme(font_scale=1.15)
plt.figure(figsize=(14, 10))
sns.set_palette("bright")
sns.set_style("white")
mask = np.zeros_like(corr)
mask[np.triu_indices_from(mask)] = True
sns.heatmap(corr,annot=True,cmap='gist_yarg_r',mask=mask,cbar=True)
plt.title('Correlation Plot')

In [None]:
# Changing threshold and calculate different validation scores

def calculator (y_test, y_predict):
    

In [2]:
#Optimizing final estimator

from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.ensemble import StackingClassifier

# Load some example data
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

# Define base models
base_models = [
    ('rf', RandomForestClassifier(n_estimators=10, random_state=42)),
    ('gb', GradientBoostingClassifier(n_estimators=10, random_state=42)),
]

# Define meta-model
meta_model = LogisticRegression()

# Construct stacking model
stack_model = StackingClassifier(estimators=base_models, final_estimator=meta_model)

# Prepare parameter grid, targeting meta-model
param_grid = {
    'final_estimator__C': [0.1, 1.0, 10.0, 100.0],
    'final_estimator__solver': ['liblinear', 'lbfgs']
}

# Instantiate and run GridSearchCV
grid_search = GridSearchCV(estimator=stack_model, param_grid=param_grid, cv=5, n_jobs=-1)
grid_search.fit(X_train, y_train)

# Best parameters and score
print("Best parameters:", grid_search.best_params_)
print("Best score:", grid_search.best_score_)


ModuleNotFoundError: No module named 'sklearn'

In [None]:
# function to store evaluation and plot them

# Aggregation 

In [None]:
# Aggregate temperature by day
daily_data_df = data_df \
    .groupby(['date', 'year', 'month', 'day', 'dayofyear'], as_index=False)\
    .agg({'temperature': np.mean}) \
    .set_index('date')

# Waffle Charts

In [None]:
#install pywaffle
#!pip install pywaffle
from pywaffle import Waffle

#Set up the Waffle chart figure

fig = plt.figure(FigureClass = Waffle,
                 rows = 20, columns = 30, #pass the number of rows and columns for the waffle 
                 values = df_dsn['Total'], #pass the data to be used for display
                 cmap_name = 'tab20', #color scheme
                 legend = {'labels': [f"{k} ({v})" for k, v in zip(df_dsn.index.values,df_dsn.Total)],
                            'loc': 'lower left', 'bbox_to_anchor':(0,-0.1),'ncol': 3}
                 #notice the use of list comprehension for creating labels 
                 #from index and total of the dataset
                )

#Display the waffle chart
plt.show()

In [None]:
# Brute force way

def create_waffle_chart(categories, values, height, width, colormap, value_sign=''):

    # compute the proportion of each category with respect to the total
    total_values = sum(values)
    category_proportions = [(float(value) / total_values) for value in values]

    # compute the total number of tiles
    total_num_tiles = width * height # total number of tiles
    print ('Total number of tiles is', total_num_tiles)
    
    # compute the number of tiles for each catagory
    tiles_per_category = [round(proportion * total_num_tiles) for proportion in category_proportions]

    # print out number of tiles per category
    for i, tiles in enumerate(tiles_per_category):
        print (df_dsn.index.values[i] + ': ' + str(tiles))
    
    # initialize the waffle chart as an empty matrix
    waffle_chart = np.zeros((height, width))

    # define indices to loop through waffle chart
    category_index = 0
    tile_index = 0

    # populate the waffle chart
    for col in range(width):
        for row in range(height):
            tile_index += 1

            # if the number of tiles populated for the current category 
            # is equal to its corresponding allocated tiles...
            if tile_index > sum(tiles_per_category[0:category_index]):
                # ...proceed to the next category
                category_index += 1       
            
            # set the class value to an integer, which increases with class
            waffle_chart[row, col] = category_index
    
    # instantiate a new figure object
    fig = plt.figure()

    # use matshow to display the waffle chart
    colormap = plt.cm.coolwarm
    plt.matshow(waffle_chart, cmap=colormap)
    plt.colorbar()

    # get the axis
    ax = plt.gca()

    # set minor ticks
    ax.set_xticks(np.arange(-.5, (width), 1), minor=True)
    ax.set_yticks(np.arange(-.5, (height), 1), minor=True)
    
    # add dridlines based on minor ticks
    ax.grid(which='minor', color='w', linestyle='-', linewidth=2)

    plt.xticks([])
    plt.yticks([])

    # compute cumulative sum of individual categories to match color schemes between chart and legend
    values_cumsum = np.cumsum(values)
    total_values = values_cumsum[len(values_cumsum) - 1]

    # create legend
    legend_handles = []
    for i, category in enumerate(categories):
        if value_sign == '%':
            label_str = category + ' (' + str(values[i]) + value_sign + ')'
        else:
            label_str = category + ' (' + value_sign + str(values[i]) + ')'
            
        color_val = colormap(float(values_cumsum[i])/total_values)
        legend_handles.append(mpatches.Patch(color=color_val, label=label_str))

    # add legend to chart
    plt.legend(
        handles=legend_handles,
        loc='lower center', 
        ncol=len(categories),
        bbox_to_anchor=(0., -0.2, 0.95, .1)
    )
    plt.show()

width = 40 # width of chart
height = 10 # height of chart

categories = df_dsn.index.values # categories
values = df_dsn['Total'] # correponding values of categories

colormap = plt.cm.coolwarm # color map class

create_waffle_chart(categories, values, height, width, colormap)