# Census Income

Project Description
This data was extracted from the 1994 Census bureau database by Ronny Kohavi and Barry Becker (Data Mining and Visualization, Silicon Graphics). A set of reasonably clean records was extracted using the following conditions: ((AAGE>16) && (AGI>100) && (AFNLWGT>1) && (HRSWK>0)). The prediction task is to determine whether a person makes over $50K a year.

Description of fnlwgt (final weight)

The weights on the Current Population Survey (CPS) files are controlled to independent estimates of the civilian non-institutional population of the US. These are prepared monthly for us by Population Division here at the Census Bureau. We use 3 sets of controls. These are:
1.	A single cell estimate of the population 16+ for each state.
2.	Controls for Hispanic Origin by age and sex.
3.	Controls by Race, age and sex.

We use all three sets of controls in our weighting program and "rake" through them 6 times so that by the end we come back to all the controls we used. The term estimate refers to population totals derived from CPS by creating "weighted tallies" of any specified socio-economic characteristics of the population. People with similar demographic characteristics should have similar weights. There is one important caveat to remember about this statement. That is that since the CPS sample is actually a collection of 51 state samples, each with its own probability of selection, the statement only applies within state.

Dataset Link-
•	https://github.com/FlipRoboTechnologies/ML_-Datasets/blob/main/Census%20Income/Census%20Income.csv


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

In [2]:
census_data_url = 'https://raw.githubusercontent.com/FlipRoboTechnologies/ML_-Datasets/main/Census%20Income/Census%20Income.csv'

df_Census =pd.read_csv(census_data_url)

df_Census.head()

HTTPError: HTTP Error 404: Not Found

In [None]:
df_Census.info()

In [None]:
df_Census.describe()

as per the above  Describe statistic we can see there are no missing values ,even lets check the missing and duplicate values if any .

In [None]:
df_Census.isnull().sum()

In [None]:
df_Census.duplicated().sum()

In [None]:
#view the duplicated rows
#Display the duplicated rows including the first occurrence


duplicated_rows = df_Census[df_Census.duplicated()]
duplicated_rows

In [None]:
df_Census.shape

In [None]:
df_Census_cleaned =df_Census.drop_duplicates(inplace=True)
df_Census.shape

In [None]:
df_Census.isnull().sum()

In [None]:
df_Census.duplicated().sum()

We can see there is no duplicates and no missing values .now we can go for another step  like df_describe

In [None]:
df_Census.describe()

This Statistics can help in understanding the distribution and central tendencies of the data which is useful for further analysis and decision making process .



The fact that the majority of values for cpacity_gain  and Capacity_loss are 0 (zero)  .

Indicates that the most individuals in the dataset did not report any capptal or losses .

this could be due to servral reasons like :              

#1.Econamic Behavior: many people do not enagage in activities that result in capital gains or losses ,such as trading stocks or selling assets


#2.income level:
#Individuals with lower incomes might not have the finacial capacity to invest in assets that generate capital gains ot losses

#3.Tax reporting , dataset Composition and econimic conditions



**Analyze Zero Values**




In [None]:
zero_capital_gain_count = df_Census[df_Census['Capital_gain'] == 0].shape[0]
zero_capital_loss_count = df_Census[df_Census['Capital_loss'] == 0].shape[0]
print("Number of rows with zero capital gain:", zero_capital_gain_count)
print("Number of rows with zero capital loss:", zero_capital_loss_count)

In [None]:
total_rows = df_Census.shape[0]
total_rows

In [None]:
zero_capital_gain_count = df_Census[df_Census['Capital_gain'] == 0].shape[0]
zero_capital_loss_count = df_Census[df_Census['Capital_loss'] == 0].shape[0]

zero_capital_gain_percentage = (zero_capital_gain_count / total_rows) * 100
zero_capital_loss_percentage = (zero_capital_loss_count / total_rows) * 100
print("Percentage of rows with zero capital gain:", zero_capital_gain_percentage)
print("Percentage of rows with zero capital loss:", zero_capital_loss_percentage)

#Percentage of rows with zero capital gain: 91.66769117285469
#Percentage of rows with zero capital loss: 95.33132530120481

In [None]:
#additional analysis
#correlation between final weight and capital gain and loss

corr_gain = df_Census[['Fnlwgt', 'Capital_gain']].corr()


print(f"Correlation between final weight and capital gain:\n{corr_gain}")


#The correlation matrix you provided indicates a perfect negative correlation between Fnlwgt (final weight) and Capital_gain. This result is unusual and suggests that as Fnlwgt increases, Capital_gain

Correlation between final weight and capital gain:

                Fnlwgt  Capital_gain

Fnlwgt         1.000000        0.000433

Capital_gain   0.000433      1.000000


In [None]:
df_Census

In [None]:
corr_loss = df_Census[['Fnlwgt','Capital_loss']].corr()
print(f"Correlation between final weight and capital loss:\n{corr_loss}")

The correlation matrix between** Fnlwgt** **(final weight) and Capital_loss** shows a very **weak negative correlation** (-0.010267), which suggests that there is **almost no linear relationship between these two variables**.

This result is much more typical and expected compared to the perfect negative correlation we saw earlier between Fnlwgt and Capital_gain.

In [None]:
df_Census.hist(figsize=(10,10))
plt.show()

In [None]:
#standardize the numerical columns
from sklearn.preprocessing import StandardScaler
num_cols = df_Census.select_dtypes(include=['number'])
scaler = StandardScaler()
num_cols_scaled = scaler.fit_transform(num_cols)
num_cols

In [None]:
#standardize the numerical columns
from sklearn.preprocessing import StandardScaler
num_cols = df_Census.select_dtypes(include=['number'])
scaler = StandardScaler()
df_Census[num_cols.columns] = scaler.fit_transform(num_cols)
df_Census.head()

In [None]:
#one hot encode the categorical features
df_Census = pd.get_dummies(df_Census, columns=['Workclass', 'Education', 'Marital_status', 'Occupation', 'Relationship', 'Race', 'Sex', 'Native_country'])
df_Census.head()


In [None]:
#encode the target feature
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
df_Census_enc = df_Census.copy()
df_Census_enc['Income'] = label_encoder.fit_transform(df_Census['Income'])
df_Census_enc.head()

In [None]:
plt.figure(figsize=(15,10))
sns.barplot(x='Education_num',y='Capital_gain',data=df_Census)
plt.title('Education_num vs Capital_gain')
plt.show()

In [None]:
plt.figure(figsize=(15,10))
sns.barplot(y='Education_num',x='Income',data=df_Census)
plt.title('Education_num vs Income')
plt.show()

In [None]:
#Correlation Matrix

numerical_df_Census = df_Census .select_dtypes(include=['int64', 'float64'])

corr_matrix = numerical_df_Census .corr()
plt.figure(figsize=(10, 8))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Correlation Matrix Census ')
plt.show()

As per above correlation Graph
1.age :  week negative correlation with Fnlwgt,weak positive correlation with Education,capital_gain,capital_loss and hours_per_week


we can see most of them are week positive correlation and some are negitive correlation with each other  .



In [None]:
#analize

plt.figure(figsize=(10, 6))
plt.subplot(1,2,1)
sns.histplot(df_Census,x = 'Capital_gain',hue='Income',multiple='stack', kde =True)
plt.title('Capital_gain Distribution by Income ')

plt.subplot(1,2,2)
sns.histplot(data = df_Census,x='Capital_loss', hue='Income',multiple='stack', kde =True)
plt.title('Capital_loss Distribution by Income')
plt.show()

In [None]:
incom_analysis = df_Census.groupby('Income')[['Capital_gain','Capital_loss']].describe()
incom_analysis

In [None]:
#split the dataset into x and y
y = df_Census_enc['Income']
df_Census_enc.drop('Income', axis=1, inplace=True)
X = df_Census_enc

**Explain a Logistic Regression Model using Coefficients**
we will focus on explaining a logistic regression model.

 A logistic regression model is intrinsically interpretable because you can immediately explain the model by looking at the coefficients.

 Larger coefficients indicate a stronger influence on the target value.

 Furthermore, we can get both positive and negative coefficients, which positively and negatively influence the probability of the target.

**We will do the following**

Instantiate the logistic regression class and fit it to the data.

To explain the model, you need to extract and plot the coefficients. For simplicity, only plot the top ten coefficients

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

Lr_model = LogisticRegression()
Lr_model.fit(X,y)

In [None]:
#Split the Data
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=42)

In [None]:
y_pred = Lr_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

In [None]:
#Confusion matrix

conf_matrix = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(conf_matrix)

#Classification Report
class_report = classification_report(y_test, y_pred)
print("Classification Report:")
print(class_report)


print("Coefficients:\n",Lr_model.coef_)

In [None]:
#plot the coefficients
coefficients = Lr_model.coef_[0]
#top_coefficients = coefficients.argsort()[-10:][::-1]

feature_names = list(X.columns)
coef_feature_pairs = list(zip(coefficients, feature_names))
top_coefficients = sorted(coef_feature_pairs, key=lambda x: abs(x[0]), reverse=True)[:10]
top_coefficients = [x[1] for x in top_coefficients]
plt.figure(figsize=(10, 6))
plt.bar(top_coefficients, [x[0] for x in top_coefficients])
plt.xticks(rotation=90)
plt.title('Top 10 Coefficients')
plt.show()


In [None]:
from sklearn.metrics import roc_curve, auc

fpr, tpr, thresholds = roc_curve(y_test, y_pred)
roc_auc = auc(fpr, tpr)
roc_auc


In [None]:
#random Forest Classifier
from sklearn.ensemble import RandomForestClassifier

Rf_model = RandomForestClassifier()
Rf_model.fit(X_train,y_train)
y_pred = Rf_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

importances = Rf_model.feature_importances_
feature_imp_df =  pd.DataFrame({'Feature': X.columns, 'Importance': importances})

feature_imp_df = feature_imp_df.sort_values(by='Importance', ascending=False)
top_n =10
top_features = feature_imp_df.head(top_n)
plt.figure(figsize=(10, 6))
plt.bar(top_features['Feature'], top_features['Importance'])
plt.xticks(rotation=90)
plt.title(f'Top {top_n} Features')
plt.show()

