<a href="https://colab.research.google.com/github/tyoungg/Colab_stuff/blob/main/random_data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Task: Predict each of these variables independently MAJOR_CASH, MAJOR_PLEDGE, COMMIT_MAJOR, INFLATION_MAJOR_COMMIT, some of the data may also require conversion from qualitative variables. By independent I mean I would like to not consider MAJOR_CASH, MAJOR_PLEDGE, COMMIT_MAJOR, INFLATION_MAJOR_COMMIT as part of the entire dataset

Here is all the data you need:
/tmp/Complete_Randomized_Dataset.csv

## 1) data_loading

### Subtask:
Load the data from the provided CSV file into a pandas DataFrame.

## Plan

1. **data_loading**: Load the data SQL into a pandas DataFrame using `pd.read_csv()`.
2. **data_exploration**:
    - Use `df.info()` and `df.describe()` to understand the data types and distributions of each variable, including `MAJOR_CASH`, `MAJOR_PLEDGE`, `COMMIT_MAJOR`, and `INFLATION_MAJOR_COMMIT`.
    - Identify qualitative variables that need conversion by checking their data types.
3. **data_preparation**: Create four separate copies of the DataFrame using `df.copy()` for predicting each target variable (`MAJOR_CASH`, `MAJOR_PLEDGE`, `COMMIT_MAJOR`, `INFLATION_MAJOR_COMMIT`).
4. **data_wrangling**:
    - In each DataFrame copy, drop the other three target variables using `df.drop(columns=['column_name'])`.
    - Convert qualitative variables into numerical representations using appropriate encoding techniques (e.g., `pd.get_dummies()` for one-hot encoding) within each DataFrame copy.
5. **feature_engineering**:
    - For each DataFrame copy, select relevant features for prediction. This may involve dropping irrelevant columns using `df.drop(columns=['column_name'])` or creating new features based on existing ones using Python 3 syntax.
    - Handle missing values in the features using appropriate techniques (e.g., `df.fillna()` for imputation) within each DataFrame copy.
6. **data_splitting**: For each DataFrame copy, split the data into training and testing sets using `train_test_split` from `sklearn.model_selection`.
7. **model_training**: Train a separate machine learning model (e.g., regression, classification) for each target variable using its corresponding DataFrame copy. Choose the model based on the nature of the target variable (continuous or categorical). Import necessary models from `sklearn` using Python 3 syntax.
    - **model_training**: Train a model to predict `MAJOR_CASH` using the first DataFrame copy.
    - **model_training**: Train a model to predict `MAJOR_PLEDGE` using the second DataFrame copy.
    - **model_training**: Train a model to predict `COMMIT_MAJOR` using the third DataFrame copy.
    - **model_training**: Train a model to predict `INFLATION_MAJOR_COMMIT` using the fourth DataFrame copy.
8. **model_evaluation**: Evaluate the performance of each model on its corresponding testing set using appropriate metrics (e.g., R-squared, accuracy, precision, recall). Import necessary metrics from `sklearn.metrics` using Python 3 syntax.
    - **model_evaluation**: Evaluate the `MAJOR_CASH` prediction model.
    - **model_evaluation**: Evaluate the `MAJOR_PLEDGE` prediction model.
    - **model_evaluation**: Evaluate the `COMMIT_MAJOR` prediction model.
    - **model_evaluation**: Evaluate the `INFLATION_MAJOR_COMMIT` prediction model.
9. **finish_task**: Write a summary report describing the process, the models trained for each target variable, their performance, and any insights gained from the analysis. Include recommendations for future work.

#Install necessary packages

In [None]:
!pip install faker
!pip install scipy
!pip install scikit-learn #Install scikit-learn if not already installed

# Create Random test data

In addtion creating correlations of:
FRIEND being 65 percent correlated to major gift = yes
ATTENDED_EVENT being 25 perscent negatively correlated with major gift = yes
Distance being 75 percent positively correlated with major gift = yes

In [None]:
import pandas as pd
import random
from faker import Faker
import numpy as np
from scipy.stats import norm

# Initialize Faker and constants
fake = Faker()
num_rows = 10000
us_states = [
    "Alabama", "Alaska", "Arizona", "Arkansas", "California", "Colorado", "Connecticut", "Delaware", "Florida",
    "Georgia", "Hawaii", "Idaho", "Illinois", "Indiana", "Iowa", "Kansas", "Kentucky", "Louisiana", "Maine",
    "Maryland", "Massachusetts", "Michigan", "Minnesota", "Mississippi", "Missouri", "Montana", "Nebraska",
    "Nevada", "New Hampshire", "New Jersey", "New Mexico", "New York", "North Carolina", "North Dakota", "Ohio",
    "Oklahoma", "Oregon", "Pennsylvania", "Rhode Island", "South Carolina", "South Dakota", "Tennessee", "Texas",
    "Utah", "Vermont", "Virginia", "Washington", "West Virginia", "Wisconsin", "Wyoming"
]

# Define base probabilities and relationships
major_cash_probs = norm.cdf(np.random.randn(num_rows))  # N(0, 1), transformed to probabilities
major_cash = np.where(major_cash_probs > 0.5, "yes", "no")

# FRIEND correlated with MAJOR_CASH (65%)
friend_probs = norm.cdf(np.random.randn(num_rows) + 0.65 * (major_cash == "yes"))
friend = np.where(friend_probs > 0.5, "yes", "no")

# ATTENDED_EVENT negatively correlated with MAJOR_CASH (25%)
attended_event_probs = norm.cdf(np.random.randn(num_rows) - 0.25 * (major_cash == "yes"))
attended_event = np.where(attended_event_probs > 0.5, "yes", "no")

# DISTANCE positively correlated with MAJOR_CASH (75%)
distance_mean = np.where(major_cash == "yes", 3000, 1000)  # Higher mean when MAJOR_CASH is "yes"
distance_std = 1000  # Standard deviation
distance = np.clip(norm.rvs(loc=distance_mean, scale=distance_std), 10, 4500)  # Clip range



# Generate random data for all fields
data = {
    "MAJOR_CASH": [random.choice(["yes", "no"]) for _ in range(num_rows)],
    "MAJOR_PLEDGE": [random.choice(["yes", "no"]) for _ in range(num_rows)],
    "COMMIT_MAJOR": [random.choice(["yes", "no"]) for _ in range(num_rows)],
    "INFLATION_MAJOR_COMMIT": [random.choice(["yes", "no"]) for _ in range(num_rows)],
    "TOTAL_COMMIT_VALUE": [random.uniform(1000, 100000) for _ in range(num_rows)],
    "BASELINE": [random.uniform(500, 50000) for _ in range(num_rows)],
    "PRINCIPAL_GIFT": [random.uniform(100, 10000) for _ in range(num_rows)],
    "BELOW_BASELINE": [random.uniform(0, 5000) for _ in range(num_rows)],
    "LEADERSHIP_GIFT": [random.uniform(5000, 50000) for _ in range(num_rows)],
    "BASELINE_NO_LEADERSHIP": [random.uniform(500, 5000) for _ in range(num_rows)],
    "LEGAL_CREDIT": [random.uniform(100, 100000) for _ in range(num_rows)],
    "CASH": [random.uniform(0, 50000) for _ in range(num_rows)],
    "PLEDGE": [random.uniform(0, 30000) for _ in range(num_rows)],
    "DEFERRED": [random.uniform(0, 10000) for _ in range(num_rows)],
    "CASH_RECEIVED": [random.uniform(0, 50000) for _ in range(num_rows)],
    "OUTSTANDING_BALANCE": [random.uniform(0, 20000) for _ in range(num_rows)],
    "NON_GIFT": [random.uniform(0, 10000) for _ in range(num_rows)],
    "DATE_FIRST_GIFT": [fake.date_between(start_date="-30y", end_date="today") for _ in range(num_rows)],
    "FIRST_GIFT_AMOUNT": [random.uniform(50, 10000) for _ in range(num_rows)],
    "INFLATION_ADJUSTED_FIRST_AMOUNT": [random.uniform(50, 15000) for _ in range(num_rows)],
    "DATE_LAST_GIFT": [fake.date_between(start_date="-5y", end_date="today") for _ in range(num_rows)],
    "LAST_AMOUNT": [random.uniform(50, 10000) for _ in range(num_rows)],
    "INFLATION_ADJ_LAST_AMOUNT": [random.uniform(50, 15000) for _ in range(num_rows)],
    "LARGEST_DATE": [fake.date_between(start_date="-10y", end_date="today") for _ in range(num_rows)],
    "LARGEST_AMOUNT": [random.uniform(100, 20000) for _ in range(num_rows)],
    "INFLATION_ADJ_LARGEST_AMOUNT": [random.uniform(100, 25000) for _ in range(num_rows)],
    "PRIMARY_UNIT": [f"Unit_{random.randint(1, 15)}" for _ in range(num_rows)],
    "PRIMARY_UNIT_LIFETIME_FUNDRAISING": [random.uniform(10000, 500000) for _ in range(num_rows)],
    "TOTAL_YEARS_GIVING": [random.randint(1, 40) for _ in range(num_rows)],
    "RECENT_CONSECUTIVE_STREAK_GIVING": [random.randint(1, 10) for _ in range(num_rows)],
    "SEGMENT": [f"Segment_{random.randint(1, 10)}" for _ in range(num_rows)],
    "LONGEST_CONSECUTIVE_STREAK": [random.randint(1, 20) for _ in range(num_rows)],
    "TERMS_ATTENDED": [random.randint(3, 18) for _ in range(num_rows)],
    "HOURS_ATTENDED": [random.randint(0, 8000) for _ in range(num_rows)],
    "AGE": [random.randint(20, 70) for _ in range(num_rows)],
   # "AGE_CAT_BY_10": [f"{age // 10 * 10}s" for age in data["AGE"]],
    "ANONYMOUS_YN": [random.choice(["yes", "no"]) for _ in range(num_rows)],
    "RECORD_STATUS": [fake.word() for _ in range(num_rows)],
    "ENROLLED_YEAR": [random.randint(1945, 2020) for _ in range(num_rows)],
    "ENROLLED_SCHOOL": [f"School_{random.randint(1, 15)}" for _ in range(num_rows)],
    "IS_FIRST_GEN_YN": [random.choice(["yes", "no"]) for _ in range(num_rows)],
    "FIRST_DEGREE_YEAR": [random.randint(1950, 2020) for _ in range(num_rows)],
    "YEARS_SINCE_LAST_DEGREE": [random.uniform(0, 70) for _ in range(num_rows)],
    "YEARS_SINCE_FIRST_DEGREE": [random.uniform(0, 70) for _ in range(num_rows)],
    "FIRST_LAST_DEGREE_DIFF": [random.uniform(0, 10) for _ in range(num_rows)],
    "Age_at_FIRST_DEGREE": [random.randint(20, 70) for _ in range(num_rows)],
    "P_COUNTRY": [random.choice(["USA", "Canada", "UK", "Germany", "France", "Japan", "India", "Australia"]) for _ in range(num_rows)],
#     "P_STATE": [random.choice(us_states) if country == "USA" else None for country in data["P_COUNTRY"]],
     "ALUMNUS_DEGREED": [random.choice(["yes", "no"]) for _ in range(num_rows)],
    "ALUMNUS_NONDEGREED": [random.choice(["yes", "no"]) for _ in range(num_rows)],
    "EXTERNAL_CONTACT": [random.choice(["yes", "no"]) for _ in range(num_rows)],
    "FORMER_EMPLOYEE_ALL": [random.choice(["yes", "no"]) for _ in range(num_rows)],
    "FORMER_SPECIFIC_EMPLOYEE": [random.choice(["yes", "no"]) for _ in range(num_rows)],
    "FRIEND": [random.choice(["yes", "no"]) for _ in range(num_rows)],
    "HOUSESTAFF": [random.choice(["yes", "no"]) for _ in range(num_rows)],
    "PARENT": [random.choice(["yes", "no"]) for _ in range(num_rows)],
    "STUDENT": [random.choice(["yes", "no"]) for _ in range(num_rows)],
    "EMPLOYEE": [random.choice(["yes", "no"]) for _ in range(num_rows)],
    "SPECIFC_EMPLOYEE": [random.choice(["yes", "no"]) for _ in range(num_rows)],
    "SPECIFIC_FRIEND": [random.choice(["yes", "no"]) for _ in range(num_rows)],
    "ANONYMOUS_YN": [random.choice(["yes", "no"]) for _ in range(num_rows)],
    "RECORD_STATUS": [fake.word() for _ in range(num_rows)],
    "ENROLLED_YEAR": [random.randint(1945, 2020) for _ in range(num_rows)],
    "ENROLLED_SCHOOL": [f"School_{random.randint(1, 15)}" for _ in range(num_rows)],
    "IS_FIRST_GEN_YN": [random.choice([True, False]) for _ in range(num_rows)],
    "FIRST_DEGREE_YEAR": [random.randint(1950, 2020) for _ in range(num_rows)],
    "YEARS_SINCE_LAST_DEGREE": [random.uniform(0, 70) for _ in range(num_rows)],
    "YEARS_SINCE_FIRST_DEGREE": [random.uniform(0, 70) for _ in range(num_rows)],
    "FIRST_LAST_DEGREE_DIFF": [random.uniform(0, 10) for _ in range(num_rows)],
    "Age_at_FIRST_DEGREE": [random.randint(20, 70) for _ in range(num_rows)],
    "SPOUSE_IS_DECEASED_YN": [random.choice(["yes", "no"]) for _ in range(num_rows)],
    "MARITAL_STATUS": [random.choice(["Single", "Married", "Divorced"]) for _ in range(num_rows)],
    "SOLICITABLE_YN": [random.choice(["yes", "no"]) for _ in range(num_rows)],
    "PHONABLE_YN": [random.choice(["yes", "no"]) for _ in range(num_rows)],
    "MAILABLE_YN": [random.choice(["yes", "no"]) for _ in range(num_rows)],
    "EMAILABLE_YN": [random.choice(["yes", "no"]) for _ in range(num_rows)],
    "GDPR_HOLD_YN": [random.choice(["yes", "no"]) for _ in range(num_rows)],
    "TOP_MANAGER_YN": [random.choice(["yes", "no"]) for _ in range(num_rows)],
    "LAST_SUBST_CONTACT_DATE": [fake.date_between(start_date="-5y", end_date="today") for _ in range(num_rows)],
    "LAST_SUBST_CONTACT_UNIT": [f"Unit_{random.randint(1, 15)}" for _ in range(num_rows)],
    "STAGE_OF_READINESS": [f"Stage_{random.randint(1, 9)}" for _ in range(num_rows)],
    "PROPOSAL_YN": [random.choice(["yes", "no"]) for _ in range(num_rows)],
    "WEALTH_RATING": [random.randint(0, 20) for _ in range(num_rows)],
    "PARENT_OF_CURRENT_STUDENT": [random.choice(["yes", "no"]) for _ in range(num_rows)],
    "PARENT_FIRST_TIME_IN_COLLEGE": [random.choice(["yes", "no"]) for _ in range(num_rows)],
    "PARENT_CURRENT_UNDERGRAD": [random.choice(["yes", "no"]) for _ in range(num_rows)],
    "ENROLLED_CHILDREN": [random.choice(["yes", "no"]) for _ in range(num_rows)],
    "COUNT_ENROLLED_CHILDREN": [random.randint(0, 4) for _ in range(num_rows)],
    "PARENT_HONORS": [random.choice(["yes", "no"]) for _ in range(num_rows)],
    "DEGREE_COUNT": [random.randint(0, 5) for _ in range(num_rows)],
    "P_STATE": [fake.state_abbr() for _ in range(num_rows)],
    "P_COUNTRY": [random.choice(["USA", "Canada", "UK", "Germany", "France", "Japan", "India", "Australia"]) for _ in range(num_rows)],
    "TOTAL_COMMITS": [random.randint(0, 100) for _ in range(num_rows)],
    "LEGACY_SOCIETY": [random.choice([True, False]) for _ in range(num_rows)],
    "PRESIDENTS_COUNCIL": [random.choice([True, False]) for _ in range(num_rows)],
    "PRESIDENTIAL_PROSPECTS": [random.choice([True, False]) for _ in range(num_rows)],
    "PRINCIPAL_GIFTS": [random.choice([True, False]) for _ in range(num_rows)],
    "CELEBRITY": [random.choice(["yes", "no"]) for _ in range(num_rows)],
    "EXEC_BOARD": [random.choice(["yes", "no"]) for _ in range(num_rows)],
    "NATIONAL_BOARD": [random.choice(["yes", "no"]) for _ in range(num_rows)],
    "LIFE_BOARD": [random.choice(["yes", "no"]) for _ in range(num_rows)],
    "A_BOARD": [random.choice(["yes", "no"]) for _ in range(num_rows)],
    "PHILANTHROPIC_FLAG": [random.choice(["yes", "no"]) for _ in range(num_rows)],
    "ENGAGEMENT_FLAG": [random.choice(["yes", "no"]) for _ in range(num_rows)],
    "EXPERIENTIAL_FLAG": [random.choice(["yes", "no"]) for _ in range(num_rows)],
    "COMMUNICATION_FLAG": [random.choice(["yes", "no"]) for _ in range(num_rows)],
    "TOTAL_CREDENTIALS": [random.randint(0, 7) for _ in range(num_rows)],
    "LAST_ACTIVITY_DATE": [fake.date_between(start_date="-3y", end_date="today") for _ in range(num_rows)],
    "LAST_ACTIVITY_TYPE": [fake.word() for _ in range(num_rows)],
    "IS_LEGACY_STUDENT": [random.choice(["yes", "no"]) for _ in range(num_rows)],
    "IS_ADV_BOARD_MEMBER": [random.choice([True, False]) for _ in range(num_rows)],
    "IS_ALUM_BOARD_MEMBER": [random.choice(["yes", "no"]) for _ in range(num_rows)],
    "ATTENDED_EVENT": [random.choice(["yes", "no"]) for _ in range(num_rows)],
    "TOTAL_EVENTS_ATTENDED": [random.randint(0, 23) for _ in range(num_rows)],
    "FRAT_OR_SOROR": [random.choice(["yes", "no"]) for _ in range(num_rows)],
    "HONOR_SOCIETY_IND": [random.choice(["yes", "no"]) for _ in range(num_rows)],
    "NUM_AWARDS_RCVD": [random.randint(0, 6) for _ in range(num_rows)],
    "NUM_ACTIVE_PLEDGES": [random.randint(0, 3) for _ in range(num_rows)],
    "GIVING_SOCIETY": [random.choice(["yes", "no"]) for _ in range(num_rows)],
 #   "DISTANCE": [random.uniform(10, 4500) if country == "USA" else None for country in data["P_COUNTRY"]],
   "A_MEMB_YN": [random.choice(["yes", "no"]) for _ in range(num_rows)],
    "A_MEMB_TYPE": [random.choice(["yes", "no"]) for _ in range(num_rows)],
    "NUM_OPEN_PROPOSALS": [random.randint(0, 3) for _ in range(num_rows)],
    "MG_MODEL_SCORE": [random.uniform(0, 5) for _ in range(num_rows)]
}

# Convert to DataFrame
df = pd.DataFrame(data)
import random
df['DISTANCE'] = [random.uniform(10, 4500) if c == "USA" else None for c in df['P_COUNTRY']]
df['P_STATE'] = [random.choice(us_states) if country == "USA" else None for country in df['P_COUNTRY']]
df['AGE_CAT_BY_10'] = [f"{age // 10 * 10}s" for age in df['AGE']]
# Set the index to include new variables
# df = df.set_index(['DISTANCE', 'P_STATE', 'AGE_CAT_BY_10'])
df = df.reset_index(drop=True)
df =  pd.concat([df, pd.DataFrame({
     "MAJOR_CASH_C": major_cash,
    "FRIEND_C": friend,
    "ATTENDED_EVENT_C": attended_event,
    "DISTANCE_C": distance
})], axis=1)


# Reset the index and update columns
df = df.reset_index()
df.columns  # Force recalculation of columns attribute

# df = df.set_index(['DISTANCE', 'P_STATE', 'AGE_CAT_BY_10'])
# Save to CSV
# output_path = "/content/Randomized_Dataset.csv"
# df.to_csv(output_path, index=True)

# print(f"Dataset with {len(data)} fields saved to {output_path}")
# df = pd.DataFrame(data)


In [None]:
#checking for all columns
df.shape[1]


#create list of columns names
for column manipulation


In [None]:
column_names = df.columns

# Convert the column names to a list
column_list = df.columns.tolist()

print(column_names)
print(column_list)

##General review of data


In [None]:
import pandas as pd
# 1. Data Shape
print(f"Data Shape: {df.shape}")

# 2. Data Types
print(f"\nData Types:\n{df.dtypes}")

# 3. Descriptive Statistics
print(f"\nDescriptive Statistics:\n{df.describe()}")

# 4. Unique Values
# for col in ['bird_name', 'device_info_serial']:
#    print(f"\nUnique Values for {col}:\n{df[col].value_counts()}")

# 5. Missing Values
print(f"\nMissing Values:\n{df.isnull().sum()}")

# 6. Correlations
# Select only numeric columns for correlation calculation
numeric_df = df.select_dtypes(include=['number'])

# Calculate correlations on the numeric DataFrame
print(f"\nCorrelations:\n{numeric_df.corr()}")


## 4): data_wrangling

### Subtask:
Convert qualitative variables into numerical representations using appropriate encoding techniques



In [None]:
import matplotlib.pyplot as plt
import pandas as pd

# df = pd.read_csv('/content/Randomized_Dataset.csv')


# Display data types and check for missing values.
print(df.info())

# Get summary statistics for numerical columns.
print(df.describe())

# Identify qualitative variables.
qualitative_vars = df.select_dtypes(include=['object']).columns.tolist()
print(f"Qualitative variables: {qualitative_vars}")

# Explore target variables.
target_vars = ['MAJOR_CASH', 'MAJOR_PLEDGE', 'COMMIT_MAJOR', 'INFLATION_MAJOR_COMMIT']
for target in target_vars:
    print(f"\n--- {target} ---")
    print(df[target].value_counts())
    # If numerical, plot a histogram.
    if df[target].dtype != 'object':
        df[target].hist()
        plt.title(f"Distribution of {target}")
        plt.show()

fill in missing data and convert qualitative variables to numeric

In [None]:
from sklearn.preprocessing import LabelEncoder, StandardScaler
# Create indicator column
df['missing_indicator'] = ''
# df['any_missing'] = 0  # Initialize 'any_missing' column to 0
# Handle missing values (if any) - for simplicity, we'll fill with the mean for numerical and mode for categorical.
for col in df.columns:
    if df[col].dtype == 'object':
        df[col] = df[col].fillna(df[col].mode()[0])
    else:
        df[col] = df[col].fillna(df[col].mean())
        df.loc[df.index, 'missing_indicator'] += col + ', '  # Use .loc to access the column correctly  # Concatenate column name
#        df.loc[df.index, 'any_missing'] = 1  # Set 'any_missing' to 1 for rows with missing values

# Convert qualitative variables to numerical using Label Encoding.
qualitative_vars = df.select_dtypes(include=['object', 'category']).columns.tolist()
for col in qualitative_vars:
    le = LabelEncoder()
    df[col] = le.fit_transform(df[col])

# Ensure 'missing_indicator' column is of string type before applying str methods
df['missing_indicator'] = df['missing_indicator'].astype(str)  # Convert to string type
# Remove trailing delimiter if present
df['missing_indicator'] = df['missing_indicator'].str.rstrip(', ')

In [None]:
df.iloc[55:58:, 100:121]

Transform the data in the four DataFrames (`df_major_cash`, `df_major_pledge`, `df_commit_major`, `df_inflation_major_commit`) to prepare them for model training. This involves dropping irrelevant columns and converting qualitative variables into numerical representations. Subsetting the data to preserve main dataset


In [None]:
df_major_cash = df.copy()  # Creating a copy to avoid modifying the original DataFrame
df_major_pledge = df.copy()
df_commit_major = df.copy()
df_inflation_major_commit = df.copy()

# # Drop irrelevant columns
df_major_cash = df_major_cash.drop(columns=['MAJOR_PLEDGE', 'COMMIT_MAJOR', 'INFLATION_MAJOR_COMMIT'])
df_major_pledge = df_major_pledge.drop(columns=['MAJOR_CASH', 'COMMIT_MAJOR', 'INFLATION_MAJOR_COMMIT'])
df_commit_major = df_commit_major.drop(columns=['MAJOR_CASH', 'MAJOR_PLEDGE', 'INFLATION_MAJOR_COMMIT'])
df_inflation_major_commit = df_inflation_major_commit.drop(columns=['MAJOR_CASH', 'MAJOR_PLEDGE', 'COMMIT_MAJOR'])
#
# Convert qualitative variables to numerical representations using one-hot encoding
def convert_qualitative_to_numerical(df):
    qualitative_vars = df.select_dtypes(include=['object']).columns
    df = pd.get_dummies(df, columns=qualitative_vars)
    return df

df_major_cash = convert_qualitative_to_numerical(df_major_cash)
df_major_pledge = convert_qualitative_to_numerical(df_major_pledge)
df_commit_major = convert_qualitative_to_numerical(df_commit_major)
df_inflation_major_commit = convert_qualitative_to_numerical(df_inflation_major_commit)

##MEAT AND POTATOES

##Loop through models
1) RANDOM FOREST

2) LOGISTIC REGRESSION

3) SVC

4) DecisionTreeClassifier


 programmatically test a variety of models, explore, and compare the results in your Google Colab environment:

Okay, let's add the twist of sampling 50 variables 3 times for each model and compare the results. include a listing of the 50 variables used for each model and target variable, along with a comparison of the results.Here's the updated code:

Example of Exploring Results:

# Calculate average accuracy for RandomForest on 'MAJOR_CASH'
rf_major_cash_accuracy = [result['accuracy'] for result in results['MAJOR_CASH']['RandomForestClassifier']]
avg_accuracy = np.mean(rf_major_cash_accuracy)
print(f"Average accuracy for RandomForest on MAJOR_CASH: {avg_accuracy}")

the whole enchilada 🌮


identify the best variables (features) for your models. We'll use feature importance scores provided by the models themselves (like RandomForest) or techniques like Recursive Feature Elimination (RFE) for models that don't directly provide feature importance.

AlltogetherNOW

**Accuracy**: Measures the overall correctness of predictions.Calculated as (number of correct predictions) / (total number of predictions).

**Precision**: Measures the proportion of true positive predictions among all positive predictions. Useful when the cost of false positives is high.

**Recall (Sensitivity)**: Measures the proportion of true positive predictions among all actual positives.
Useful when the cost of false negatives is high.

**F1-Score**: The harmonic mean of precision and recall.Provides a balance between the two metrics.

**Confusion Matrix**

A table that visualizes the performance of a classification model.
Shows the counts of true positives, true negatives, false positives, and false negatives.

from sklearn.metrics import confusion_matrix
   import seaborn as sns
   import matplotlib.pyplot as plt

   cm = confusion_matrix(y_true, y_pred)  
   sns.heatmap(cm, annot=True, fmt='d')  
   plt.xlabel('Predicted')
   plt.ylabel('Actual')
   plt.title('Confusion Matrix')
   plt.show()

**Classification Report**

Provides a summary of key classification metrics (precision, recall, F1-score, support) for each class.

from sklearn.metrics import classification_report

   report = classification_report(y_true, y_pred)
   print(report)
Use code with caution
Incorporating into Your Code

After Model Training: Calculate and print these metrics after training each of your models.
Using y_true and y_pred: Ensure you have the actual target values (y_true) and the model's predictions (y_pred) to use in the metric calculations.
Comparison: Compare the metrics across different models to understand their relative strengths and weaknesses.
By systematically monitoring these performance metrics, you can gain valuable insights into your models' behavior and make informed decisions about model selection and improvement. Let me know if you'd like me to show you how to integrate these into your existing code!

More all to gether

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.neural_network import MLPClassifier # Import MLP Classifier
from sklearn.inspection import permutation_importance
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score,  confusion_matrix, classification_report
# from sklearn.feature_selection import RFE  # For Recursive Feature Elimination

# Define a list of models to test
models = [
#    LogisticRegression(random_state=42),
    LogisticRegression(random_state=42, solver='saga', max_iter=1000),
    RandomForestClassifier(random_state=42),
    SVC(random_state=42),
    DecisionTreeClassifier(random_state=42),
    MLPClassifier(random_state=42, hidden_layer_sizes=(100,), max_iter=500)  # Add MLP
]

# Define a dictionary to store results
results = {}

# Create a list to store all results for DataFrame
all_results = []

# Main loop through target variables, models, and sampling iterations
for target_variable in ['MAJOR_CASH', 'MAJOR_PLEDGE', 'COMMIT_MAJOR', 'INFLATION_MAJOR_COMMIT']:
    results[target_variable] = {}
    for model in models:
        model_name = type(model).__name__
        results[target_variable][model_name] = []

        for iteration in range(3):
            # Get all features except target variable
            all_features = [col for col in df.columns if col != target_variable]

            # Randomly sample 50 features
            sampled_features = np.random.choice(all_features, size=50, replace=False)

            # Get the data for the current target variable and sampled features
            X = df[sampled_features]
            y = df[target_variable]

            # Split the data
            X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

            # Train the model
            model.fit(X_train, y_train)

            # Make predictions
            y_pred = model.predict(X_test)

            # Calculate evaluation metrics
            accuracy = accuracy_score(y_test, y_pred)
            precision = precision_score(y_test, y_pred, zero_division=1)
            recall = recall_score(y_test, y_pred, zero_division=1)
            f1 = f1_score(y_test, y_pred, zero_division=1)
			# Calculate metrics
            cm = confusion_matrix(y_test, y_pred)  # Get confusion matrix
            report = classification_report(y_test, y_pred, output_dict=True, zero_division=1)  # Get classification report as dict
            perm_importance = permutation_importance(model, X_test, y_test, scoring='accuracy', random_state=42)


            # Store results in the dictionary and list for DataFrame
            results[target_variable][model_name].append({
                'iteration': iteration + 1,
                'features': sampled_features.tolist(),
                'accuracy': accuracy,
                'precision': precision,
                'recall': recall,
                'f1': f1,
                'perm_importance_mean': perm_importance.importances_mean.tolist(),  # Add importance
        				'cm':cm,
				        'report':report
            })
            all_results.append({
                'target_variable': target_variable,
                'model': model_name,
                'iteration': iteration + 1,
                'features': sampled_features.tolist(),
                'accuracy': accuracy,
                'precision': precision,
                'recall': recall,
                'f1': f1,
				        'perm_importance_mean': perm_importance.importances_mean.tolist(),  # Store importance
                'cm':cm,
				'report':report
            })

# Create a DataFrame from the results list
results_df = pd.DataFrame(all_results)

# Display the DataFrame
print(results_df)

        # Feature importance for models with built-in importance scores
if hasattr(model, 'feature_importances_'):
            model.fit(X, y)  # Fit on all data for feature importance
            importances = model.feature_importances_
            indices = np.argsort(importances)[::-1]  # Sort feature indices by importance

            print(f"\n-- {model_name} --")
            for f in range(X.shape[1]):
                if importances[indices[f]] > 0:  # Print only features with importance > 0
                   print("%d. feature %s (%f)" % (f + 1, X.columns[indices[f]], importances[indices[f]]))

# RFE for models without built-in importance scores
else:
    #print(f"\n-- {model_name} (using RFE) --")
    print(f"\n-- {model_name} (using Permutation Importance) --")

    # Explicitly set 'linear' kernel for SVC to enable coef_ attribute
    if model_name == 'SVC':
        estimator = SVC(kernel='linear', random_state=42)
    else:
        estimator = model  # Use the original model if not SVC

  #  rfe = RFE(estimator=estimator, n_features_to_select=10)  # Select top 10 features
  #  rfe.fit(X, y)

      # Fit the model on all data for permutation importance
    model.fit(X, y)

    # Calculate permutation importance
    perm_importance = permutation_importance(model, X, y, scoring='accuracy', random_state=42)

    # Sort feature indices by importance
    indices = np.argsort(perm_importance.importances_mean)[::-1]


    # Print selected features
    for f in range(X.shape[1]): # changed i to f
      #  if rfe.support_[i]:
       #     print(f"{i + 1}. feature {X.columns[i]}")
         if perm_importance.importances_mean[indices[f]] > 0:  # Print only features with importance > 0
            print("%d. feature %s (%f)" % (f + 1, X.columns[indices[f]], perm_importance.importances_mean[indices[f]]))


Possible Solutions

Here are a few options to address this warning:

Increase max_iter:
The most straightforward approach is to increase the max_iter parameter in your LogisticRegression model. This allows the solver more iterations to try and converge.
Try increasing it to a larger value, such as 10000 or even higher, and see if the warning disappears.
Add the following code inside the loop after models are defined:

if model_name == 'LogisticRegression':
       model.set_params(max_iter=10000)  # Increase max_iter for Logistic Regression
Use code with caution
Adjust tol (tolerance):
You can try decreasing the tol parameter, which controls the tolerance for stopping criteria. This might allow the solver to converge faster, but be careful as it could potentially lead to premature convergence.
Experiment with smaller values, such as tol=1e-5 or tol=1e-6, and observe the effect on the warning and performance.
Add the following code inside the loop after models are defined:

if model_name == 'LogisticRegression':
       model.set_params(tol=1e-5)  # Adjust tolerance
Use code with caution
Use a different solver:
While 'saga' is generally a good choice for large datasets, you could try using other solvers like 'lbfgs' or 'liblinear' to see if they converge more easily for your specific data. Change the code to:

models = [
       LogisticRegression(random_state=42, solver='lbfgs', max_iter=1000), # changing solver from 'saga' to 'lbfgs'
   #...(rest of your models)
   ]
Use code with caution
Should you skip the model?

If the warning persists even after trying the above solutions and your accuracy is poor, it might be worth considering skipping the LogisticRegression model with the 'saga' solver, especially if other models perform significantly better.
However, if the accuracy is acceptable or comparable to other models despite the warning, it might be okay to keep it in the analysis. You can monitor the warning and make a judgment based on overall model performance.
Incorporating the changes

Apply the chosen solution(s) to the code snippet where you define the LogisticRegression model within your loop. Remember to track the warning messages and performance metrics to assess the effectiveness of your modifications.

I recommend starting with increasing max_iter and then experimenting with adjusting tol if needed. If those don't resolve the issue, consider trying a different solver or potentially skipping the model depending on your accuracy goals.

github.com/scikit-learn/scikit-learn/discussions/27062
github.com/scikit-learn/scikit-learn/issues/27561
stackoverflow.com/questions/46028914/multilayer-perceptron-convergencewarning-stochastic-optimizer-maximum-iterat
stackoverflow.com/questions/53784971/how-to-disable-convergencewarning-using-sklearn
datascience.stackexchange.com/questions/38143/multilayer-perceptron-does-not-converge


In [None]:
results_df.to_csv('results.csv', index=False)