A: QUESTION:
What factors influence churn rate?
C1: PLAN TO ASSESS QUALITY OF DATA:
To find missing values: I downloaded numpy pandas and used isna() and isnull() function to see if there are any missing values in the data. To find duplicate values: I used the duplicated() function to see if there are any values that are duplicated. To detect outliers: I plotted histograms for all the variables to check and see which one had outliers, then I plotted the ones with outliers in boxplots and used the median to impute the outliers to have a normal distribution.
C2: JUSTIFICATION OF APPROACH:
I used isna and isnull to find missing values because it the easiest and most common way to find null values and can be done quickly. I used the duplicated() function to be able to find duplicated data because it is the way I learned how to find duplicates in python. I used a histogram to find outliers because it is the most visual way to see outliers and are easily detected by looking for bars that look unusual from the normal distribution.
C3: JUSTIFICATION OF TOOLS:
I used python to clean my data because I would like to grow my skills in python, also because it is used more in the data science world as compared to R since there are a lot more packages and different ways to integrate it with other tools in a company.
I used the following tools in Python: Pandas, math, numpy, matplotlib.pyplot, scipy.stats, sklearn.decomposition I used Pandas to read the CSV, and read the DataFrame information. I used numpy to find missing values using isna() and also look for duplicate rows using duplicate(). I used matlotlib.pyplot to plot graphs and create visualizations. I used SciPy to create boxplots and detect normalization of each graph. I used sklearn in order to run a PCA.
D1: CLEANING FINDINGS:
Duplicates: I did not find any duplicated in the data.
Missing data: For the missing values there were 8 columns that had missing values.
- Children had 2495 missing values
- Age had 2475 missing values
- Income had 2490 missing values
- Techie had 2477 missing values
- Phone had 1026 missing values
- TechSupport had 991 missing values
- Tenure had 931 missing values
- Bandwidth_GB_Year had 1021 missing values
Outliers: The columns that have outliers are: Income, Outage_sec_perweek, Email, Contacts, Yearly_equip_failure, and MonthlyCharge.
-
Income contained 709 number of outliers and the range of those outliers are from $65,000 to $300,000
-
Outage_sec_perweek contained 503 number of outliers and the range of those outliers are from 2.5 seconds and blow and an upper limit of 17 seconds to 60 seconds
-
Email contained 15 number of outliers and the range of those outliers are from 4 and below and an upper limit of 20 and above
-
Contacts contained 8 number of outliers and the range of those outliers are from 5 to 7
-
Yearly_equip_failure contained 94 number of outliers and the range of those outliers are from 2 to 6
-
MonthlyCharge contained 3 number of outliers and the range of those outliers are from 300 and above
D2: JUSTIFICATION OF MITIGATION METHODS:
I did not find duplicates in the data but if I did, I would drop the duplicated rows.
For the missing values, I used the mean if the data was normally distributed, the median when the data is skewed or bi-modal and used the mode when it was a categorical data.
- 'Children' I filled the NAs with the median because it is positively skewed
- 'Age' I filled the NAs with the mean because it has a normal distribution
- 'Income' I filled the NAs with the mean because it has a normal distribution
- 'Techie' I filled the NAs with the mode because it is categorical data
- 'Phone' I filled the NAs with the mode because it is categorical data
- 'TechSupport' I filled the NAs with the mode because it is categorical data
- 'Tenure' I filled the NAs with the mode because it is categorical data
- ‘Bandwidth_GB_Year' I filled the NAs with the mode because it is categorical data
For outliers I first used a boxplot to plot all the categorical values and see if they had outliers. Then for the ones that did have outliers I decided to impute them using the median for all the variables. The variables that I treated where: Income, Outage_sec_perweek, Email, Contacts, Yearly_equip_failure, and MonthlyCharge.
I first plotted them using boxplots, then I ran a query to see how many outliers there were, then I turned the outliers into NAs, then I replaced the NAs with the median so that all the outliers are now normal values. I used the median for each variable because the mean can highly influence the outliers and median are less sensitive to outliers.
D3: SUMMARY OF THE OUTCOMES:
I checked for duplicates and found none, so the dataset has no redundant entries. For missing data, I used different methods based on the data distribution: the mean for normally distributed data, the median for skewed data, and the mode for categorical data. I made these choices after looking at histograms to ensure the data looked right. Now, there are no missing values. I handled outliers by identifying them with boxplots and then replacing them with the median to keep the data balanced without extreme values. Now, all variables are even, with no missing values or outliers.
Here are the results of cleaning my outliers and the cleaned-up variables:
D6: LIMITATIONS:
The methods I used to clean the data have some downsides. Dropping duplicates can make us lose important data and unbalance the dataset. Filling missing values with the mean for numbers can reduce variability and introduce bias, while using the mode for categories oversimplifies and can distort relationships. For outliers, using boxplots to find them, turning them into NAs, and then using the median can be subjective, change original values, and reduce variability. While these methods help prepare data for analysis, they can affect the accuracy and reliability of the results, leading to less accurate insights.
D7: IMPACT OF LIMITATIONS:
Even after I cleaned the data, there could still be problems affecting my research. Some data might still be missing or have hidden errors, and I might have removed some useful information by mistake. If I made many changes during cleaning and didn't document them well, it might be tough to understand the data fully. These problems can limit my analysis, make me less confident in my findings, and require more time and effort to address, which can complicate answering my research question.
E1: PRINCIPAL COMPONENTS:
In my PCA I used these variables: 'Income', 'Outage_sec_perweek', 'Tenure', 'MonthlyCharge', 'Bandwidth_GB_Year'. I used these values because I these are categorical variables and I can run a PCA on them.
E2: CRITERIA USED:
I believe PC1, PC2 and PC3 should be retained because it is the only ones that has an eigenvalue of 1 or more, which according to the Kaiser Rule should be retained, while the other PCs are lower than 1.
E3: BENEFITS:
I believe the benefits of PCA is that it helps me simplify data, make better visualizations, reduce noise, improve model performance, and compress data. By focusing on the most important components, I can get clearer insights, make smarter decisions, and use resources more effectively. In my PCA example, focusing on components related to income, bandwidth usage, and monthly charges allows for more targeted and effective analysis, leading to better business results.