# Homework-3

**Waheeb Algabri, Joe Garcia, Lwin Shwe, Mikhail Broomes**

In [6]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from scipy import stats
import itertools
from sklearn import datasets
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_curve, auc


## Homework 3 - Logistic Regression

### Analysis of Factors that Influence Crime Rates:

In this homework assignment, we'll delve into examining, dissecting, and modeling a dataset that provides insights into crime across different neighborhoods within a bustling metropolis. Each entry in the dataset includes a variable denoting whether the crime rate exceeds the median value `(1)` or falls below it `(0)`
Our goal is to construct a binary logistic regression model using the training dataset. This model aims to forecast whether a neighborhood is prone to experiencing elevated levels of crime.
Here's a brief rundown of the key variables in the dataset:



|Column|Description|
|--|--|
|`zn`|proportion of residential land zoned for large lots (over 25000 square feet) (*predictor variable*)|
|`indus`|proportion of non-retail business acres per suburb (*predictor variable*)|
|`chas`|a dummy var. for whether the suburb borders the Charles River (1) or not (0) (*predictor variable*)|
|`nox`|nitrogen oxides concentration (parts per 10 million) (*predictor variable*)|
|`rm`|average number of rooms per dwelling (*predictor variable*)|
|`age`|proportion of owner-occupied units built prior to 1940 (*predictor variable*)
|`dis`|weighted mean of distances to five Boston employment centers (*predictor variable*)
|`rad`|index of accessibility to radial highways (*predictor variable*)|

|**`target`**|**whether the crime rate is above the median crime rate (1) or not (0) (*response variable*)**|


In [12]:
# Load training and test data
train_df = pd.read_csv("https://raw.githubusercontent.com/waheeb123/Data-621/main/Homeworks/Homework-3/crime-training-data")
test_df = pd.read_csv("https://raw.githubusercontent.com/waheeb123/Data-621/main/Homeworks/Homework-3/crime-evaluation-data")


## Data Exploration:

In [14]:
# Check the shape of the training dataset
print("Shape of training dataset:", train_df.shape)


Shape of training dataset: (466, 13)


The dataset consists of 466 observations of 13 variables. There are 12 predictor variables and one response variable `target`

In [15]:
# Display the structure of the training dataset
print(train_df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 466 entries, 0 to 465
Data columns (total 13 columns):
 #   Column   Non-Null Count  Dtype   
---  ------   --------------  -----   
 0   zn       466 non-null    float64 
 1   indus    466 non-null    float64 
 2   chas     466 non-null    category
 3   nox      466 non-null    float64 
 4   rm       466 non-null    float64 
 5   age      466 non-null    float64 
 6   dis      466 non-null    float64 
 7   rad      466 non-null    int64   
 8   tax      466 non-null    int64   
 9   ptratio  466 non-null    float64 
 10  lstat    466 non-null    float64 
 11  medv     466 non-null    float64 
 12  target   466 non-null    category
dtypes: category(2), float64(9), int64(2)
memory usage: 41.3 KB
None


All of the columns in the dataset are numeric, but the predictor variable `chas` is a dummy variable, as is the response variable `target` We re-code them as categories.

In [19]:
# Convert 'chas' and 'target' columns to categorical variables
train_df['chas'] = train_df['chas'].astype('category')
train_df['target'] = train_df['target'].astype('category')


Let's take a look at the summary statistics for the variables in the dataset.

In [20]:
# Summary statistics for the dataset
print(train_df.describe())

# Standard deviation
print(train_df.drop(columns=['chas', 'target']).std())


               zn       indus         nox          rm         age         dis  \
count  466.000000  466.000000  466.000000  466.000000  466.000000  466.000000   
mean    11.577253   11.105021    0.554311    6.290674   68.367597    3.795693   
std     23.364651    6.845855    0.116667    0.704851   28.321378    2.106950   
min      0.000000    0.460000    0.389000    3.863000    2.900000    1.129600   
25%      0.000000    5.145000    0.448000    5.887250   43.875000    2.101425   
50%      0.000000    9.690000    0.538000    6.210000   77.150000    3.190950   
75%     16.250000   18.100000    0.624000    6.629750   94.100000    5.214600   
max    100.000000   27.740000    0.871000    8.780000  100.000000   12.126500   

              rad         tax     ptratio       lstat        medv  
count  466.000000  466.000000  466.000000  466.000000  466.000000  
mean     9.530043  409.502146   18.398498   12.631459   22.589270  
std      8.685927  167.900089    2.196845    7.101891    9.239681 

We can see the mean, median, standard deviations, etc. for each of the variables in the dataset.

For the `target` variable, there are 229 instances where crime level is above the median level (`target` = 1) and 237 instances where crime is not above the median level (`target` = 0). Since the response variable is fairly balanced between its two levels, the data will not require any resampling to weight the distributions for each level.

The `tax` variable does not follow our expectations that higher property tax rates would correspond with less crime.

The minimum, first quantile, and median values for `zn` are all 0. This variable refers to the proportion of residential land zoned for large lots. Most of the neighborhoods in this dataset do not have land that is zoned for large lots.

For the `age` variable, the median is higher than the mean. This indicates the data is left-skewed, meaning there is a greater proportion of homes that were built prior to 1940 in the dataset.

There do not appear to be any missing values to address. Let's validate this.
