**Libraries Imported:**

**pandas (pd):** Library for data manipulation and analysis.<br>
**numpy (np):** Library providing support for multi-dimensional arrays and matrices.<br>
**matplotlib.pyplot (plt):** Library for creating visualizations in Python.<br>
**Modules and Classes Imported from sklearn:**<br>
**make_pipeline:** Constructs a pipeline from a series of transformers and an estimator.<br>
**make_column_transformer:** Constructs a transformer from columns or arrays.<br>
**make_column_selector:** Constructs a selector to choose columns.<br>
**StandardScaler:** Preprocessing class for standardizing features by scaling them to have a mean of 0 and variance of 1.<br>
**LabelEncoder:** Preprocessing class for encoding categorical features into numeric labels.<br>
**OneHotEncoder:** Preprocessing class for one-hot encoding categorical features.<br>
**SimpleImputer:** Preprocessing class for imputing missing values in datasets.<br>
**RandomForestClassifier:** Ensemble learning method based on decision trees for classification tasks.<br>
**VotingClassifier:** Combines multiple individual classifiers to make a final prediction using majority voting or averaging.<br>
**LogisticRegression:** Linear classification model for logistic regression.<br>
**train_test_split:** Function for splitting datasets into training and testing sets.<br>
**SVC:** Support Vector Classifier for classification tasks.<br>
These imported modules and classes from sklearn encompass a wide range of functionalities for data preprocessing, model creation, ensemble methods, and evaluation within the machine learning workflow.

In [28]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.pipeline import make_pipeline
from sklearn.compose import make_column_transformer, make_column_selector
from sklearn.preprocessing import StandardScaler,LabelEncoder, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC

In [29]:
disabled = pd.read_csv('India-disability-data(cleaned).csv')
df = disabled.copy()
df.head()

Unnamed: 0,State Code,Area Name,Total/ Rural/Urban,Disability,Age-group,Total disabled population - Persons,Total disabled population - Males,Total disabled population - Females,Main worker - Persons,Main worker - Males,Main worker - Females,Non-worker - Persons,Non-worker - Males,Non-worker - Females
0,0,INDIA,Total,Total,Total,26814994,14988593,11826401,6982009,5464857,1517152,17070608,7915768,9154840
1,0,INDIA,Total,Total,0-14,5572336,3073214,2499122,100779,61870,38909,5344297,2942702,2401595
2,0,INDIA,Total,Total,15-59,15728243,9125226,6603017,5808809,4559220,1249589,7785245,3317990,4467255
3,0,INDIA,Total,Total,60+,5376619,2713995,2662624,1036384,816764,219620,3854887,1614909,2239978
4,0,INDIA,Total,Total,Age not stated,137796,76158,61638,36037,27003,9034,86179,40167,46012


In [30]:
df.columns

Index(['State Code', 'Area Name', 'Total/ Rural/Urban', 'Disability',
       'Age-group', 'Total disabled population - Persons',
       'Total disabled population - Males',
       'Total disabled population - Females', 'Main worker - Persons',
       'Main worker - Males', 'Main worker - Females', 'Non-worker - Persons',
       'Non-worker - Males', 'Non-worker - Females'],
      dtype='object')


This code filters a DataFrame named df to exclude rows where specific columns **('Area Name', 'Total/ Rural/Urban', 'Disability', 'Age-group')** contain the value **'Total'**. This operation effectively removes entries representing aggregated or total values in these categorical columns, refining the DataFrame to focus on specific categories rather than overall sums or aggregates.

In [31]:
df = df[df['Area Name'] != 'INDIA'] 
df = df[df['Total/ Rural/Urban'] != 'Total']
df = df[df['Disability'] != 'Total']
df = df[df['Age-group'] != 'Total']
df.head()

Unnamed: 0,State Code,Area Name,Total/ Rural/Urban,Disability,Age-group,Total disabled population - Persons,Total disabled population - Males,Total disabled population - Females,Main worker - Persons,Main worker - Males,Main worker - Females,Non-worker - Persons,Non-worker - Males,Non-worker - Females
186,1,State - JAMMU & KASHMIR (01),Rural,In-Seeing,0-14,10502,5720,4782,73,44,29,10027,5442,4585
187,1,State - JAMMU & KASHMIR (01),Rural,In-Seeing,15-59,23093,12785,10308,6090,5450,640,11035,3807,7228
188,1,State - JAMMU & KASHMIR (01),Rural,In-Seeing,60+,16496,8482,8014,1953,1779,174,12186,5141,7045
189,1,State - JAMMU & KASHMIR (01),Rural,In-Seeing,Age not stated,51,32,19,8,8,0,38,22,16
191,1,State - JAMMU & KASHMIR (01),Rural,In-Hearing,0-14,12461,6652,5809,122,91,31,11994,6393,5601


**df.reset_index(drop=True, inplace=True)** resets the index of the DataFrame df after the previous filtering operations and makes the changes in place, dropping the existing index. The resulting DataFrame will have a new index that starts from zero and increments sequentially.

In [32]:
df.reset_index(drop=True, inplace=True)
df

Unnamed: 0,State Code,Area Name,Total/ Rural/Urban,Disability,Age-group,Total disabled population - Persons,Total disabled population - Males,Total disabled population - Females,Main worker - Persons,Main worker - Males,Main worker - Females,Non-worker - Persons,Non-worker - Males,Non-worker - Females
0,1,State - JAMMU & KASHMIR (01),Rural,In-Seeing,0-14,10502,5720,4782,73,44,29,10027,5442,4585
1,1,State - JAMMU & KASHMIR (01),Rural,In-Seeing,15-59,23093,12785,10308,6090,5450,640,11035,3807,7228
2,1,State - JAMMU & KASHMIR (01),Rural,In-Seeing,60+,16496,8482,8014,1953,1779,174,12186,5141,7045
3,1,State - JAMMU & KASHMIR (01),Rural,In-Seeing,Age not stated,51,32,19,8,8,0,38,22,16
4,1,State - JAMMU & KASHMIR (01),Rural,In-Hearing,0-14,12461,6652,5809,122,91,31,11994,6393,5601
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2235,35,State - ANDAMAN & NICOBAR ISLANDS (35),Urban,Any-Other,Age not stated,0,0,0,0,0,0,0,0,0
2236,35,State - ANDAMAN & NICOBAR ISLANDS (35),Urban,Multiple-Disability,0-14,60,37,23,1,1,0,59,36,23
2237,35,State - ANDAMAN & NICOBAR ISLANDS (35),Urban,Multiple-Disability,15-59,88,53,35,15,15,0,71,37,34
2238,35,State - ANDAMAN & NICOBAR ISLANDS (35),Urban,Multiple-Disability,60+,17,8,9,2,2,0,15,6,9


In [34]:
# df = df.drop('Total disabled population - Persons', axis=1)
df = df.drop('State Code', axis=1)
df

Unnamed: 0,Area Name,Total/ Rural/Urban,Disability,Age-group,Total disabled population - Persons,Total disabled population - Males,Total disabled population - Females,Main worker - Persons,Main worker - Males,Main worker - Females,Non-worker - Persons,Non-worker - Males,Non-worker - Females
0,JAMMU & KASHMI,Rural,In-Seeing,0-14,10502,5720,4782,73,44,29,10027,5442,4585
1,JAMMU & KASHMI,Rural,In-Seeing,15-59,23093,12785,10308,6090,5450,640,11035,3807,7228
2,JAMMU & KASHMI,Rural,In-Seeing,60+,16496,8482,8014,1953,1779,174,12186,5141,7045
3,JAMMU & KASHMI,Rural,In-Seeing,Age not stated,51,32,19,8,8,0,38,22,16
4,JAMMU & KASHMI,Rural,In-Hearing,0-14,12461,6652,5809,122,91,31,11994,6393,5601
...,...,...,...,...,...,...,...,...,...,...,...,...,...
2235,ANDAMAN & NICOBAR ISLAND,Urban,Any-Other,Age not stated,0,0,0,0,0,0,0,0,0
2236,ANDAMAN & NICOBAR ISLAND,Urban,Multiple-Disability,0-14,60,37,23,1,1,0,59,36,23
2237,ANDAMAN & NICOBAR ISLAND,Urban,Multiple-Disability,15-59,88,53,35,15,15,0,71,37,34
2238,ANDAMAN & NICOBAR ISLAND,Urban,Multiple-Disability,60+,17,8,9,2,2,0,15,6,9


In [35]:
# df['Total disabled population - Males'] = df['Total disabled population - Males']/df['Total disabled population - Persons']
df['Total disabled population - Females'] = df['Total disabled population - Females']/df['Total disabled population - Persons']
df['Main worker - Males'] = df['Main worker - Males']/df['Total disabled population - Persons']
df['Main worker - Females'] = df['Main worker - Females']/df['Total disabled population - Persons']
df['Non-worker - Males'] = df['Non-worker - Males']/df['Total disabled population - Persons']
df['Non-worker - Females'] = df['Non-worker - Females']/df['Total disabled population - Persons']



df

Unnamed: 0,Area Name,Total/ Rural/Urban,Disability,Age-group,Total disabled population - Persons,Total disabled population - Males,Total disabled population - Females,Main worker - Persons,Main worker - Males,Main worker - Females,Non-worker - Persons,Non-worker - Males,Non-worker - Females
0,JAMMU & KASHMI,Rural,In-Seeing,0-14,10502,5720,0.455342,73,0.004190,0.002761,10027,0.518187,0.436584
1,JAMMU & KASHMI,Rural,In-Seeing,15-59,23093,12785,0.446369,6090,0.236002,0.027714,11035,0.164855,0.312995
2,JAMMU & KASHMI,Rural,In-Seeing,60+,16496,8482,0.485815,1953,0.107844,0.010548,12186,0.311651,0.427073
3,JAMMU & KASHMI,Rural,In-Seeing,Age not stated,51,32,0.372549,8,0.156863,0.000000,38,0.431373,0.313725
4,JAMMU & KASHMI,Rural,In-Hearing,0-14,12461,6652,0.466174,122,0.007303,0.002488,11994,0.513041,0.449482
...,...,...,...,...,...,...,...,...,...,...,...,...,...
2235,ANDAMAN & NICOBAR ISLAND,Urban,Any-Other,Age not stated,0,0,,0,,,0,,
2236,ANDAMAN & NICOBAR ISLAND,Urban,Multiple-Disability,0-14,60,37,0.383333,1,0.016667,0.000000,59,0.600000,0.383333
2237,ANDAMAN & NICOBAR ISLAND,Urban,Multiple-Disability,15-59,88,53,0.397727,15,0.170455,0.000000,71,0.420455,0.386364
2238,ANDAMAN & NICOBAR ISLAND,Urban,Multiple-Disability,60+,17,8,0.529412,2,0.117647,0.000000,15,0.352941,0.529412


In [36]:
labels = df['Disability']
df.drop('Disability', axis=1, inplace = True)
df

Unnamed: 0,Area Name,Total/ Rural/Urban,Age-group,Total disabled population - Persons,Total disabled population - Males,Total disabled population - Females,Main worker - Persons,Main worker - Males,Main worker - Females,Non-worker - Persons,Non-worker - Males,Non-worker - Females
0,JAMMU & KASHMI,Rural,0-14,10502,5720,0.455342,73,0.004190,0.002761,10027,0.518187,0.436584
1,JAMMU & KASHMI,Rural,15-59,23093,12785,0.446369,6090,0.236002,0.027714,11035,0.164855,0.312995
2,JAMMU & KASHMI,Rural,60+,16496,8482,0.485815,1953,0.107844,0.010548,12186,0.311651,0.427073
3,JAMMU & KASHMI,Rural,Age not stated,51,32,0.372549,8,0.156863,0.000000,38,0.431373,0.313725
4,JAMMU & KASHMI,Rural,0-14,12461,6652,0.466174,122,0.007303,0.002488,11994,0.513041,0.449482
...,...,...,...,...,...,...,...,...,...,...,...,...
2235,ANDAMAN & NICOBAR ISLAND,Urban,Age not stated,0,0,,0,,,0,,
2236,ANDAMAN & NICOBAR ISLAND,Urban,0-14,60,37,0.383333,1,0.016667,0.000000,59,0.600000,0.383333
2237,ANDAMAN & NICOBAR ISLAND,Urban,15-59,88,53,0.397727,15,0.170455,0.000000,71,0.420455,0.386364
2238,ANDAMAN & NICOBAR ISLAND,Urban,60+,17,8,0.529412,2,0.117647,0.000000,15,0.352941,0.529412


In [37]:
df.isna().sum()

Area Name                                0
Total/ Rural/Urban                       0
Age-group                                0
Total disabled population - Persons      0
Total disabled population - Males        0
Total disabled population - Females    110
Main worker - Persons                    0
Main worker - Males                    110
Main worker - Females                  110
Non-worker - Persons                     0
Non-worker - Males                     110
Non-worker - Females                   110
dtype: int64

In [38]:
# simpleImpute = SimpleImputer(strategy='median')
# df = simpleImpute.fit_transform(df)
# df

In [39]:
cat_coder = OneHotEncoder()
cat_coder.fit(df.iloc[:,:3])
a  = cat_coder.transform(df.iloc[:,:3])
cat_coder.categories_

[array(['ANDAMAN & NICOBAR ISLAND', 'ANDHRA PRADES', 'ARUNACHAL PRADES',
        'ASSA', 'BIHA', 'CHANDIGAR', 'CHHATTISGAR', 'DADRA & NAGAR HAVEL',
        'DAMAN & DI', 'GO', 'GUJARA', 'HARYAN', 'HIMACHAL PRADES',
        'JAMMU & KASHMI', 'JHARKHAN', 'KARNATAK', 'KERAL', 'LAKSHADWEE',
        'MADHYA PRADES', 'MAHARASHTR', 'MANIPU', 'MEGHALAY', 'MIZORA',
        'NAGALAN', 'NCT OF DELH', 'ODISH', 'PUDUCHERR', 'PUNJA',
        'RAJASTHA', 'SIKKI', 'TAMIL NAD', 'TRIPUR', 'UTTAR PRADES',
        'UTTARAKHAN', 'WEST BENGA'], dtype=object),
 array(['Rural', 'Urban'], dtype=object),
 array(['0-14', '15-59', '60+', 'Age not stated'], dtype=object)]

In [40]:
(cat_coder.get_feature_names_out())

array(['Area Name_ANDAMAN & NICOBAR ISLAND', 'Area Name_ANDHRA PRADES',
       'Area Name_ARUNACHAL PRADES', 'Area Name_ASSA', 'Area Name_BIHA',
       'Area Name_CHANDIGAR', 'Area Name_CHHATTISGAR',
       'Area Name_DADRA & NAGAR HAVEL', 'Area Name_DAMAN & DI',
       'Area Name_GO', 'Area Name_GUJARA', 'Area Name_HARYAN',
       'Area Name_HIMACHAL PRADES', 'Area Name_JAMMU & KASHMI',
       'Area Name_JHARKHAN', 'Area Name_KARNATAK', 'Area Name_KERAL',
       'Area Name_LAKSHADWEE', 'Area Name_MADHYA PRADES',
       'Area Name_MAHARASHTR', 'Area Name_MANIPU', 'Area Name_MEGHALAY',
       'Area Name_MIZORA', 'Area Name_NAGALAN', 'Area Name_NCT OF DELH',
       'Area Name_ODISH', 'Area Name_PUDUCHERR', 'Area Name_PUNJA',
       'Area Name_RAJASTHA', 'Area Name_SIKKI', 'Area Name_TAMIL NAD',
       'Area Name_TRIPUR', 'Area Name_UTTAR PRADES',
       'Area Name_UTTARAKHAN', 'Area Name_WEST BENGA',
       'Total/ Rural/Urban_Rural', 'Total/ Rural/Urban_Urban',
       'Age-group_0-14', 

In [41]:
class AreaCoder:
    def __init__(self):
        self.n_features_in_ = 10  # Assuming a value of 10 for demonstration purposes

# Creating an instance of the AreaCoder class
area_coder = AreaCoder()

# Accessing the n_features_in attribute
print(area_coder.n_features_in_)  # This should print the value assigned (e.g., 10)


10


In [42]:
 encoded_data = (pd.DataFrame(a.toarray(), columns=cat_coder.get_feature_names_out()))

In [43]:
a.toarray()

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 1., 0., 0.],
       [0., 0., 0., ..., 0., 1., 0.],
       ...,
       [1., 0., 0., ..., 1., 0., 0.],
       [1., 0., 0., ..., 0., 1., 0.],
       [1., 0., 0., ..., 0., 0., 1.]])

In [44]:
b = cat_coder.inverse_transform(a)
b

array([['JAMMU & KASHMI', 'Rural', '0-14'],
       ['JAMMU & KASHMI', 'Rural', '15-59'],
       ['JAMMU & KASHMI', 'Rural', '60+'],
       ...,
       ['ANDAMAN & NICOBAR ISLAND', 'Urban', '15-59'],
       ['ANDAMAN & NICOBAR ISLAND', 'Urban', '60+'],
       ['ANDAMAN & NICOBAR ISLAND', 'Urban', 'Age not stated']],
      dtype=object)

In [45]:
encoded_data.columns

Index(['Area Name_ANDAMAN & NICOBAR ISLAND', 'Area Name_ANDHRA PRADES',
       'Area Name_ARUNACHAL PRADES', 'Area Name_ASSA', 'Area Name_BIHA',
       'Area Name_CHANDIGAR', 'Area Name_CHHATTISGAR',
       'Area Name_DADRA & NAGAR HAVEL', 'Area Name_DAMAN & DI', 'Area Name_GO',
       'Area Name_GUJARA', 'Area Name_HARYAN', 'Area Name_HIMACHAL PRADES',
       'Area Name_JAMMU & KASHMI', 'Area Name_JHARKHAN', 'Area Name_KARNATAK',
       'Area Name_KERAL', 'Area Name_LAKSHADWEE', 'Area Name_MADHYA PRADES',
       'Area Name_MAHARASHTR', 'Area Name_MANIPU', 'Area Name_MEGHALAY',
       'Area Name_MIZORA', 'Area Name_NAGALAN', 'Area Name_NCT OF DELH',
       'Area Name_ODISH', 'Area Name_PUDUCHERR', 'Area Name_PUNJA',
       'Area Name_RAJASTHA', 'Area Name_SIKKI', 'Area Name_TAMIL NAD',
       'Area Name_TRIPUR', 'Area Name_UTTAR PRADES', 'Area Name_UTTARAKHAN',
       'Area Name_WEST BENGA', 'Total/ Rural/Urban_Rural',
       'Total/ Rural/Urban_Urban', 'Age-group_0-14', 'Age-group_15-

In [46]:
# encoded_data.isnull().sum()
labels

0                 In-Seeing
1                 In-Seeing
2                 In-Seeing
3                 In-Seeing
4                In-Hearing
               ...         
2235              Any-Other
2236    Multiple-Disability
2237    Multiple-Disability
2238    Multiple-Disability
2239    Multiple-Disability
Name: Disability, Length: 2240, dtype: object

# Voting Classfier

In [47]:
X = encoded_data.copy().values
y = labels.values

**train_test_split(X, y, random_state=42)**
Splits the dataset into training and testing sets for both features (X) and target labels (y). random_state=42 ensures reproducibility by fixing the random seed.

**VotingClassifier**
Creates an ensemble classifier that combines predictions from multiple base estimators (classifiers). The provided estimators are:

**Logistic Regression ('lr'):*** A linear classification model.
**Random Forest ('rf'):*** An ensemble learning method based on decision trees.
**Support Vector Classifier ('svc'):** A classifier that uses support vector machines.
**voting_clf.fit(X_train, y_train)**
Trains the VotingClassifier using the training data (X_train, y_train). The individual classifiers specified within the VotingClassifier are trained on the provided data.

This ensemble technique aggregates predictions from multiple classifiers to make a final prediction, potentially enhancing overall performance by leveraging diverse modeling approaches.

In [73]:

X_train, X_test, y_train, y_test = train_test_split(X, y,random_state=42)

voting_clf = VotingClassifier(
    estimators=[
        ('lr', LogisticRegression(random_state=42)),
        ('rf', RandomForestClassifier(random_state=42)),
        ('svc', SVC(random_state=42))
    ]
)
voting_clf.fit(X_train, y_train)

**for name, clf in voting_clf.named_estimators_.items():**
Iterates through each individual estimator (classifier) contained within the VotingClassifier. named_estimators_ is an attribute that provides access to the individual estimators along with their assigned names.

**clf.score(X_test, y_test)**
Calculates the accuracy score of each individual classifier (clf) by evaluating its predictions on the test data (X_test, y_test). The score method computes the accuracy of the classifier by comparing its predictions with the true labels and returns the accuracy score.

**print(f"{name} accuracy = {score}")**
Displays the accuracy score of each individual classifier by printing its name (name) along with the corresponding accuracy score (score). This allows for a comparison of the performance of each classifier within the ensemble on the test dataset.

In [75]:
for name, clf in voting_clf.named_estimators_.items():
    print(name, "=", clf.score(X_test, y_test))

lr = 0.0
rf = 0.0
svc = 0.0


**voting_clf.predict(X_test[:10])**
Generates predictions using the VotingClassifier (voting_clf) on the first 10 samples from the test set **(X_test[:10])**. This method applies the ensemble approach to make predictions for each of the provided samples.

The result (predicted_labels) will be an array containing the predicted class labels for these 10 samples based on the aggregated decisions of the individual classifiers within the VotingClassifier.

In [76]:
voting_clf.predict(X_test[:10])

array(['Any-Other', 'Mental-Illness', 'In-Movement', 'Any-Other',
       'Mental-Illness', 'In-Seeing', 'In-Speech', 'Any-Other',
       'In-Seeing', 'Any-Other'], dtype=object)

In [77]:
ypred=voting_clf.predict(X_test)

In [78]:
from sklearn.metrics import accuracy_score as ac

In [83]:
print(ac(ypred,y_test))

0.014285714285714285


In [81]:
from sklearn.metrics import precision_score, recall_score, f1_score, roc_auc_score, confusion_matrix

# Assuming you have a trained voting classifier model named 'voting_clf' and test data 'X_test', 'y_test'
y_pred = voting_clf.predict(X_test)

# Precision (example with 'macro' averaging)
precision = precision_score(y_test, y_pred, average='macro')

# Recall (example with 'macro' averaging)
recall = recall_score(y_test, y_pred, average='macro')

# F1 Score (example with 'macro' averaging)
f1 = f1_score(y_test, y_pred, average='macro')

# ROC-AUC Score is not directly supported for multiclass; consider using One-vs-Rest strategy
# Here's an example for a binary case (might need adaptation for multiclass)
roc_auc = roc_auc_score(y_test, y_pred)  # Not directly applicable for multiclass

# Confusion Matrix
conf_matrix = confusion_matrix(y_test, y_pred)

print("Precision:", precision)
print("Recall:", recall)
print("F1 Score:", f1)
print("ROC-AUC Score (binary):", roc_auc)  # For multiclass, consider other strategies or metrics
print("Confusion Matrix:\n", conf_matrix)


ValueError: could not convert string to float: 'Any-Other'