# Preprocess data using `index.ipynb` steps

In [2]:
import pandas as pd
from sklearn.preprocessing import OrdinalEncoder, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier  # or your chosen model
import pickle

# Load dataset
df_preprocess = pd.read_csv('../anemia_dataset.csv')

# Dropping rows without target variable
df_preprocess = df_preprocess.dropna(subset=['Anemia level'])

# Separate numerical columns (int and float types)
numeric_cols = df_preprocess.select_dtypes(include=['int64', 'float64']).copy()
numeric_cols.drop(['Hemoglobin level adjusted for altitude (g/dl - 1 decimal)'], axis=1, inplace=True)

# Rename numerical columns for consistency
numeric_cols.rename(columns={
    'Births in last five years':'Births_last_5y',
    'Age of respondent at 1st birth':'Age_first_birth',
    'Hemoglobin level adjusted for altitude and smoking (g/dl - 1 decimal)':'Hemoglobin_level'
}, inplace=True)

# Separate categorical columns
categorical_cols = df_preprocess.select_dtypes(include=['object', 'category']).copy()
categorical_cols.drop(['When child put to breast', 'Anemia level.1', 'Smokes cigarettes'], axis=1, inplace=True)

# Replace "Don't know" with "unknown" and fill missing values
categorical_cols['Had fever in last two weeks'].replace("Don't know", 'unknown', inplace=True)
categorical_cols['Taking iron pills, sprinkles or syrup'].replace("Don't know", 'unknown', inplace=True)
columns_to_fill = ['Taking iron pills, sprinkles or syrup', 'Had fever in last two weeks', 'Currently residing with husband/partner']
categorical_cols[columns_to_fill] = categorical_cols[columns_to_fill].fillna('unknown')

# Rename categorical columns
categorical_cols.rename(columns={
    'Have mosquito bed net for sleeping (from household questionnaire)': 'Mosquito_net',
    'Highest educational level': 'Education_level',
    'Wealth index combined': 'Wealth',
    'Currently residing with husband/partner': 'Living_with_spouse',
    'Type of place of residence': 'Area_Type',
    'Taking iron pills, sprinkles or syrup': 'Taking_meds',
    'Age in 5-year groups': 'Age_group',
    'Anemia level': 'Anemia_level',
    'Had fever in last two weeks': 'Had_fever',
    'Current marital status': 'Marital_status'
}, inplace=True)

# Split the target column
y = categorical_cols['Anemia_level']
categorical_cols = categorical_cols.drop(['Anemia_level'], axis=1)

# Concatenate cleaned numerical and categorical columns
combined_df = pd.concat([numeric_cols, categorical_cols], axis=1)

# Drop rows with missing values in combined data
combined_df.dropna(inplace=True)

# Encode categorical features using OrdinalEncoder
encoder = OrdinalEncoder()
encoded_categorical = encoder.fit_transform(categorical_cols)



# Replace the categorical columns with their encoded versions
encoded_df = pd.DataFrame(encoded_categorical, columns=categorical_cols.columns)
combined_df = pd.concat([numeric_cols.reset_index(drop=True), encoded_df], axis=1)

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(combined_df, y, test_size=0.2, random_state=42)

# Define the preprocessing pipeline for scaling
numeric_transformer = StandardScaler()
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_cols.columns)
    ]
)

# Create the model pipeline
pipeline = Pipeline(steps=[('preprocessor', preprocessor)])

# Fit the pipeline on training data
pipeline.fit(X_train)

# Train the model (RandomForest example)
model = RandomForestClassifier(
    max_depth=30,
    max_features='sqrt',  # Changed 'auto' to 'sqrt'
    min_samples_leaf=1,
    min_samples_split=2,
    n_estimators=200,
    random_state=42
)

model.fit(pipeline.transform(X_train), y_train)

# Save the encoder, pipeline and the model
with open('anemia_model.pkl', 'wb') as f:
    pickle.dump((encoder,pipeline, model), f)


# MODEL DEPLOYMENT

## Overview
This project aims to build a web application for predicting anemia levels in children. The application provides a user-friendly interface for users to input relevant data and receive predictions along with tailored recommendations.

## Procedure

### 1. **Pickle the encoder, pipeline and model**
- The Random Forest Model was trained and pickled together with the encoder object and the preprocessing pipeline to ensure it could be easily loaded for predictions within the web application.

### 2. **Download Dependencies**
- Flask was installed as the web framework to build the web application, allowing for the creation of routes, handling of form data, and serving of HTML templates.

### 3. **Organize Project Folder**
- The project directory was structured to maintain organization and clarity. The setup included creating separate directories for the Flask app, the pickled model, HTML templates, CSS styles, and image assets.
- **PostgreSQL** and **SQLAlchemy** were added to handle the storage of prediction results in a relational database.
    - **`psycopg2`**: PostgreSQL adapter for Python.
    - **`Flask-SQLAlchemy`**: Extension for integrating SQLAlchemy with Flask to interact with the PostgreSQL database.


#### - Structure Overview
```
Deployment/
├── app.py                     # Flask app
├── anemia_model.pkl           # Pickled pipeline, encoder, and model (created by merge.ipynb)
├── merge.ipynb                # Jupyter notebook for saving the pickled model
├── templates/
│   ├── index.html             # HTML template for home page
│   ├── general_info.html      # HTML template for general info page
│   ├── factors.html           # HTML template for EDA insights
│   ├── prediction.html        # HTML template for prediction form
│   └── result.html            # HTML template for result page
├── static/
│   ├── css/
│   │   └── styles.css         # CSS styles
│   └── images/
│       ├── background.jpg     # Background image
│       ├── anemic_child.jpg   # Anemic child image
│       ├── Anemia level, Education level and Area_type.png
│       ├── anemia-level_vs_age-group.png
│       ├── anemia-level_vs_area-type.png
│       ├── anemia-level_vs_education-level.png
│       ├── Distribution-Age at first birth.png
│       └── Distribution- Hemoglobin level.png

```

### 4. **Set Up the Flask App**
The Flask app (`app.py`) contains routes to manage the different pages and features:
- **Home (`/`)**: Displays the `index.html` template, which includes a welcome message and a brief description of the project.
- **General Info (`/general_info`)**: Renders the `general_info.html` page with educational content about anemia, its causes, and prevention.
- **Factors (`/factors`)**: Displays the `factors.html` template, which shows key insights and visualizations from the exploratory data analysis (EDA) phase.
- **Prediction (`/predict`)**: Accepts user input, makes a prediction using the pre-trained model, and displays the results on the `prediction.html` and `result.html` pages.

### 5. **Create HTML Templates**
The `templates/` folder includes several key HTML files:
- **`index.html`**: The homepage that greets the user and includes a form for entering the required details to predict anemia levels.
- **`general_info.html`**: Provides educational information about anemia, its symptoms, and prevention strategies.
- **`factors.html`**: Displays various visual insights from the EDA phase, such as the relationship between anemia levels and factors like age, education, and area type.
- **`prediction.html`**: Contains the form where users enter their details for prediction.
- **`result.html`**: Displays the prediction result and provides personalized recommendations based on the input.

### 6. **UI Design**
The user interface is styled using CSS in `static/css/styles.css`:
- **Background and Header Images**: The visual design is enhanced with `background.jpg` as the page background and `anemic_child.jpg` as the header image.
- **Layout and Styling**: Flexbox is used to ensure a responsive layout. Buttons and links have interactive hover effects for better user engagement.

### 7. **Include EDA Insights with Images**
The `factors.html` template displays several EDA visualizations:
- **Anemia Level vs Age Group**: The distribution of anemia levels across different age groups of children, shown in `anemia-level_vs-age-group.png`.
- **Anemia Level vs Area Type**: A comparison of anemia levels between urban and rural areas, shown in `anemia-level_vs-area-type.png`.
- **Anemia Level vs Education Level**: The correlation between parents' education level and children's anemia severity, shown in `anemia-level_vs-education-level.png`.
- **Distribution of Hemoglobin Levels**: The spread of hemoglobin levels among the children, shown in `Distribution-Hemoglobin-level.png`.
- **Mother's Age at First Birth**: A distribution plot showing the age at which mothers had their first child, shown in `Distribution-Age-at-first-birth.png`.

These images are stored in `static/images/` and provide critical insights into the factors influencing anemia prevalence in children.


### 8. **Integrate PostgreSQL for Saving Predictions**
To persist predictions, the following steps were performed:
- **SQLAlchemy Setup**: 
    - `Flask-SQLAlchemy` was configured to connect the Flask app with the PostgreSQL database hosted on Render.
    - The PostgreSQL database connection URI was stored in the app's configuration file using `SQLALCHEMY_DATABASE_URI`.
    - SQLAlchemy’s ORM was used to define the structure of the `Predictions` table.
- **Prediction Model**: The prediction results, along with the user inputs, were saved into a `predictions` table in the PostgreSQL database.

