# Task 10


In [1]:
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.feature_extraction.text import TfidfVectorizer

# Define categorical and text features
categorical_features = ['Country', 'Disease Category', 'Gender']
text_features = ['Disease Name', 'Description']

# Preprocessing for categorical features
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),  # Handle missing values
    ('onehot', OneHotEncoder(handle_unknown='ignore'))    # One-hot encoding
])

# Preprocessing for text features
text_transformer = Pipeline(steps=[
    ('tfidf', TfidfVectorizer(stop_words='english', max_features=1000))  # TF-IDF with 1000 features
])

# Combine preprocessing steps
preprocessor = ColumnTransformer(
    transformers=[
        ('cat', categorical_transformer, categorical_features),
        ('text', text_transformer, text_features)
    ]
)

# Example usage in a pipeline
from sklearn.ensemble import RandomForestRegressor

pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('model', RandomForestRegressor())
])

# Fit the pipeline
pipeline.fit(X_train, y_train)

NameError: name 'X_train' is not defined

## Handling Categorical Attributes

### 1. Processing Categorical Attributes
Categorical attributes are non-numeric features that represent discrete values (e.g., Country, Disease Category, Gender). These attributes need to be converted into a numerical format for machine learning models to process them.

### 2. Encoding Techniques

#### One-Hot Encoding
This technique converts each category into a binary vector (0 or 1). For example, if Gender has three categories (Male, Female, Other), one-hot encoding will create three binary columns: `Gender_Male`, `Gender_Female`, and `Gender_Other`.

**Justification:** One-hot encoding is suitable for nominal categorical features (no inherent order) because it avoids introducing artificial ordinal relationships.

#### Label Encoding
This technique assigns a unique integer to each category. For example:
```
Male = 0
Female = 1
Other = 2
```
**Justification:** Label encoding is suitable for ordinal categorical features (e.g., Age Group: `0-18`, `19-35`, `36-60`, `61+`), where the order matters.

### 3. Handling Missing Categorical Values

#### Imputation
Missing categorical values are filled with the most frequent value (mode) using `SimpleImputer`.

**Justification:** This approach preserves the distribution of the categorical feature and avoids introducing bias.

---

## Handling Text Attributes

### 1. Preprocessing Text Data
Text data (e.g., Disease Name, Description) requires preprocessing to convert it into a numerical format. The steps include:

#### Cleaning
- Remove special characters, punctuation, and stopwords.
- Convert text to lowercase.

#### Tokenization
- Split text into individual words or tokens.

#### Stemming/Lemmatization
- Reduce words to their root form (e.g., `running` → `run`).

### 2. Converting Text into Numerical Format

#### Bag-of-Words (BoW)
Represents text as a vector of word frequencies.

**Justification:** Simple and effective for small datasets.

#### TF-IDF (Term Frequency-Inverse Document Frequency)
Weighs words based on their importance in the document and across the corpus.

**Justification:** Reduces the impact of common words and emphasizes unique words.

#### Word Embeddings
Represents words as dense vectors in a continuous vector space (e.g., Word2Vec, GloVe).

**Justification:** Captures semantic relationships between words.

### 3. Challenges and Solutions

#### Challenge 1: High Dimensionality
- Text data can result in a large number of features (e.g., thousands of unique words).
- **Solution:** Use dimensionality reduction techniques (e.g., PCA) or limit the vocabulary size.

#### Challenge 2: Out-of-Vocabulary Words
- Some words in the test set may not be present in the training set.
- **Solution:** Use `handle_unknown='ignore'` in `OneHotEncoder` or limit the vocabulary size.

---

### Code Example
```python
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from sklearn.impute import SimpleImputer
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import numpy as np

# Example for One-Hot Encoding
ohe = OneHotEncoder(sparse_output=False, handle_unknown='ignore')
X = np.array([['Male'], ['Female'], ['Other'], ['Male']])
encoded = ohe.fit_transform(X)
print("One-Hot Encoded Data:\n", encoded)

# Example for Label Encoding
le = LabelEncoder()
labels = np.array(['Low', 'Medium', 'High', 'Medium'])
encoded_labels = le.fit_transform(labels)
print("Label Encoded Data:", encoded_labels)

# Example for TF-IDF Vectorization
tfidf = TfidfVectorizer()
corpus = ["This is an example sentence", "Another example sentence"]
X_tfidf = tfidf.fit_transform(corpus)
print("TF-IDF Feature Names:", tfidf.get_feature_names_out())