### Introduction
__Business problem__

## Project Goals
- **Examine** historical box office performance across various genres, budgets, revenues and release dates.  
- **Identify** key trends that contribute to a movie’s commercial success.  
- **Recommend** data-driven strategies to guide the creation and marketing of new films.

### Data Understanding
The data source for this aanalysis was gotten tn.movie_budgets.csv

We will:
- Import the relevant libraries
- Load the data into a dataframe
- Explore and extract data for my analysis
- Data Visualization interpratation
- Provide Recommendations


__Import libraries__

In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, roc_auc_score, roc_curve
from sklearn.preprocessing import LabelEncoder, StandardScaler

__Load Data__

In [4]:
data = pd.read_csv('C:/Projects/Project_phase3/bigml_59c28831336c6604c800002a.csv', index_col=0)
data.head()

Unnamed: 0_level_0,account length,area code,phone number,international plan,voice mail plan,number vmail messages,total day minutes,total day calls,total day charge,total eve minutes,total eve calls,total eve charge,total night minutes,total night calls,total night charge,total intl minutes,total intl calls,total intl charge,customer service calls,churn
state,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
KS,128,415,382-4657,no,yes,25,265.1,110,45.07,197.4,99,16.78,244.7,91,11.01,10.0,3,2.7,1,False
OH,107,415,371-7191,no,yes,26,161.6,123,27.47,195.5,103,16.62,254.4,103,11.45,13.7,3,3.7,1,False
NJ,137,415,358-1921,no,no,0,243.4,114,41.38,121.2,110,10.3,162.6,104,7.32,12.2,5,3.29,0,False
OH,84,408,375-9999,yes,no,0,299.4,71,50.9,61.9,88,5.26,196.9,89,8.86,6.6,7,1.78,2,False
OK,75,415,330-6626,yes,no,0,166.7,113,28.34,148.3,122,12.61,186.9,121,8.41,10.1,3,2.73,3,False


__Data Understanding__

In [5]:
data.shape

(3333, 20)

In [6]:
print(data.info())
print(data.describe())
print(data['churn'].value_counts(normalize=True))

<class 'pandas.core.frame.DataFrame'>
Index: 3333 entries, KS to TN
Data columns (total 20 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   account length          3333 non-null   int64  
 1   area code               3333 non-null   int64  
 2   phone number            3333 non-null   object 
 3   international plan      3333 non-null   object 
 4   voice mail plan         3333 non-null   object 
 5   number vmail messages   3333 non-null   int64  
 6   total day minutes       3333 non-null   float64
 7   total day calls         3333 non-null   int64  
 8   total day charge        3333 non-null   float64
 9   total eve minutes       3333 non-null   float64
 10  total eve calls         3333 non-null   int64  
 11  total eve charge        3333 non-null   float64
 12  total night minutes     3333 non-null   float64
 13  total night calls       3333 non-null   int64  
 14  total night charge      3333 non-null   float6

## Data Cleaning

In [10]:
data_clean = data.drop(columns=['phone number', 'area code'])
le = LabelEncoder()
data_clean['international plan'] = le.fit_transform(data_clean['international plan'])
data_clean['voice mail plan'] = le.fit_transform(data_clean['voice mail plan'])

X = data_clean.drop(columns=['churn'])
y = data_clean['churn']

In [12]:
data['international plan'] = data['international plan'].replace({'yes': 1, 'no': 0})
data['voice mail plan'] = data['voice mail plan'].replace({'yes': 1, 'no': 0})
data['churn'] = data['churn'].astype(int)

# One-hot encoding
data = pd.get_dummies(data, columns=['area code'], drop_first=True)

# Drop irrelevant column
data.drop('phone number', axis=1, inplace=True)

# Verify
print(data.head())
print("\nChurn distribution:\n", data['churn'].value_counts())
print("\nMissing values:\n", data.isnull().sum())

       account length  international plan  voice mail plan  \
state                                                        
KS                128                   0                1   
OH                107                   0                1   
NJ                137                   0                0   
OH                 84                   1                0   
OK                 75                   1                0   

       number vmail messages  total day minutes  total day calls  \
state                                                              
KS                        25              265.1              110   
OH                        26              161.6              123   
NJ                         0              243.4              114   
OH                         0              299.4               71   
OK                         0              166.7              113   

       total day charge  total eve minutes  total eve calls  total eve charge  \
state  