#Module 1: Descriptive Stastics in Python Exercise 
Template used from the [module](https://github.com/digitalshawn/STC551/blob/main/Module%201/Descriptive_Stats_Exercise.ipynb) on canvas. 

Here we load the modules we will use in this script. They are the same modules that are used in the [example notebook](https://github.com/digitalshawn/STC551/blob/main/Module%201/Descriptive%20Stats%20Example.ipynb).

In [None]:
%matplotlib

In [33]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import plotly.express as px # accessible module for plotting graphs
from scipy.stats import skew, kurtosis # to analyze the skew of our dataset
import plotly.figure_factory as ff

# Loading the GoFundMe Data

Below we load the GoFundMe data directly via its GitHub URL. Briefly take a look [at the data file](https://raw.githubusercontent.com/lmeninato/GoFundMe/master/data-raw/GFM_data.csv). You'll see that although the files ends in .csv, the fields are delimited (seperated) via a tab and not a comma. You'll see that I've flagged this for panda's read_csv() function using the `sep` argument and setting it equal to a tab (`\t`).



In [34]:
df = pd.read_csv("https://raw.githubusercontent.com/lmeninato/GoFundMe/master/data-raw/GFM_data.csv", sep="\t")


# Let's explore the data file






In [35]:
#1. First few rows of the data file 
df.head(5)

Unnamed: 0.1,Unnamed: 0,Url,Category,Position,Title,Location,Amount_Raised,Goal,Number_of_Donators,Length_of_Fundraising,FB_Shares,GFM_hearts,Text,Latitude,Longitude
0,0,https://www.gofundme.com/3ctqm-medical-bills-f...,Medical,0,92 Yr old Man Brutally Attacked.,"LOS ANGELES, CA",327345.0,15000,12167,1 month,26k,12k,Rodolfo Rodriguez needs your help today! 92 Yr...,34.052234,-118.243685
1,1,https://www.gofundme.com/olivia-stoy-bone-marr...,Medical,0,Olivia Stoy:Transplant & Liv it up!,"ASHLEY, IN",316261.0,1.0M,5598,3 months,12k,5.7k,Thomas Stoy needs your help today! Olivia Stoy...,41.527273,-85.065523
2,2,https://www.gofundme.com/autologous-Tcell-Tran...,Medical,1,AUTOLOGOUS T CELL TRANSPLANT,"STATEN ISLAND, NY",241125.0,250000,841,2 months,1.8k,836,Philip Defonte needs your help today! AUTOLOGO...,40.579532,-74.150201
3,3,https://www.gofundme.com/a-chance-of-rebirth,Medical,1,A chance of rebirth,"DUBLIN, CA",237424.0,225000,4708,1 month,9.7k,4.7k,Sriram Kanniah needs your help today! A chance...,37.702152,-121.935792
4,4,https://www.gofundme.com/teamclaire,Medical,1,Claire Wineland Needs Our Help,"GARDEN GROVE, CA",236590.0,225000,8393,2 months,6.4k,8.9k,Melissa Yeager needs your help today! Claire W...,33.774269,-117.937995


**2. List and write the description of column headers**
1. **Unnamed: 0** = Index (result number).   
2. **URL** = Link to the GoFundMe page. 
3. **Category** = Type of Fundraiser. 
4. **Position** = Result Position (on the webpage).
5. **Title** = Name of the Fundraiser. 
6. **Location** = The city and state of where the post was created (crowdfunding location).
7. **Amount Raised** = Amount of money the fundraiser raised. 
8. **Goal** = The "target" amount to fundraise (to reach or pass). 
9. **Number of Donators** = Number of people who contributed and donated money.
10. **Length of Fundraising** = How long the fundraiser and page took donations for. 
11. **FB Shares** = Number of people who shared the webpage/fundraiser to Facebook.  
12. **GFM Hearts**  = Number of "hearts" per fundraising post. 
13. **Text** = Description of the Fundraiser (purpose). 
14. **Longitude** = Exact Longitude [of location].  
15. **Latitude** = Exact Latitude [of location]. 

# Campaigns by Category

In [36]:
#1. How many campaigns are in each category? 
campaigns = df['Category'].value_counts()
campaigns 

Medical        76
Memorial       72
Volunteer      72
Travel         72
Sports         72
Newlywed       72
Family         72
Faith          72
Event          72
Creative       72
Competition    72
Community      72
Business       72
Education      72
Charity        72
Emergency      72
Wishes         72
Animals        10
11525.0         1
-73.9495823     1
-75.3199035     1
Name: Category, dtype: int64

In [37]:
#2. What is the average $ amount raised in each category? 
categories_average = {'Category': ['Medical', 'Memorial', 'Volunteer', 'Travel', 'Sports', 
                                   'Newlywed', 'Family', 'Faith', 'Event', 'Creative', 'Competition', 
                                   'Community', 'Business', 'Education', 'Charity', 'Emergency', 'Wishes', 'Animals'], 
                      'Amount Raised Average': []}

for each_category in categories_average['Category']: 
  sum = 0 
  for each_item in df.loc[df['Category'] == each_category]['Amount_Raised']:
    if str(each_item) != 'nan': 
      sum += each_item 
    else: 
      sum += 0 
  avg = '{:,.2f}'.format(sum/len(df.loc[df['Category'] == each_category]))
  categories_average['Amount Raised Average'].append(avg)


raised_avg = pd.DataFrame(categories_average) 
raised_avg

Unnamed: 0,Category,Amount Raised Average
0,Medical,147340.41
1,Memorial,115498.94
2,Volunteer,13642.47
3,Travel,6902.65
4,Sports,19540.12
5,Newlywed,3430.5
6,Family,63499.86
7,Faith,12545.35
8,Event,10825.94
9,Creative,25302.35


In [38]:
def change_dollar(money): 
  #Returns a money intger 
  if 'M' in str(money):    #'.0M' in money: 
    temp = str(money).replace('.', '')
    money = str(temp).replace("M", '00000')
  else: 
    money = str(money).replace(',', '')
  return money

In [39]:
df['clean_goal'] = df['Goal'].apply(change_dollar)

In [40]:
#3. What is the average fundraising goal in each category? 
avg_cat_goal = {'Category': ['Medical', 'Memorial', 'Volunteer', 'Travel', 'Sports', 
                                   'Newlywed', 'Family', 'Faith', 'Event', 'Creative', 'Competition', 
                                   'Community', 'Business', 'Education', 'Charity', 'Emergency', 'Wishes', 'Animals'], 
                      'Goal Average': []}

for each_category in avg_cat_goal['Category']: 
  sum = 0 
  for each_item in df.loc[df['Category'] == each_category]['clean_goal']:
    if each_item == 'nan': 
      sum += 0 
    else: 
      sum += int(each_item)
  avg = '{:,.2f}'.format(sum/len(df.loc[df['Category'] == each_category]))
  avg_cat_goal['Goal Average'].append(avg)

avg_cgoal = pd.DataFrame(avg_cat_goal)
avg_cgoal

Unnamed: 0,Category,Goal Average
0,Medical,199735.76
1,Memorial,112638.89
2,Volunteer,46422.08
3,Travel,57241.96
4,Sports,26839.46
5,Newlywed,20813.94
6,Family,77055.56
7,Faith,54238.03
8,Event,14073.61
9,Creative,77225.14


***#4: Provide a text summary of the results***
The general trend from the above findings is that the average amount raised was most likely lower than the average goal amount. The only exception to this trend was in the category of Memorial. The highest average amount raised was in the Medical category, followed by Community, Emergency, and Memorial. The highest average goal was in the Medical category, followed by Charity, Emergency, and Memorial. 
 
**Note:** Some of the Amount Raised values were "nan," and therefore, were opted to 0 since the data was unavailable.

# Looking for outliers in shares and hearts


In [41]:
#Cleaning Data 
def clean_number(shares):
  #Returns a clean numeric value 
  if 'k'in str(shares) and '.' in str(shares): 
    temp = str(shares).replace('.', '')
    shares = str(temp).replace('k', '00')
    shares = int(shares)
  elif 'k' in str(shares): 
    temp = str(shares).replace('.', '')
    shares = str(temp).replace('k', '000')
    shares = int(shares)
  elif str(shares) == 'nan': 
    shares = 0 
  return int(shares) 

df['clean_fb_shares'] = df['FB_Shares'].apply(clean_number)
df['clean_gfm_hearts'] = df['GFM_hearts'].apply(clean_number)

For the following section, I explored these categories: Animal, Travel, and Medical. 
1. Select 3 catgories and create a boxplot of the FB shares and GFM hearts
2. Plot the outliers in the boxplot
3. Calculate the mean, median, mode, std deviation, and variance for the 3 categories' FB shares and GFM hearts

In [42]:
#Category: Animal FB SHARES
df_animals = df.loc[df['Category'] == 'Animals']
animals_fb = df_animals['clean_fb_shares']

#Graphing the Distributions FB SHARES
fig = px.box(animals_fb, x = 'clean_fb_shares', title = 'Category-Animal: Distribution by FB Shares')
fig.show()

#Category: Animal = Calculate the mean, mode, std deviation, and variance for the 3 categories' FB Shares 
print('Mean FB Shares:', animals_fb.mean()) 
print('Median FB Shares:', animals_fb.median())
print('Mode FB Shares:', animals_fb.mode())
print('STD Deviation FB Shares:', animals_fb.std())
print('Variance of FB Share:', animals_fb.var())

Mean FB Shares: 5412.7
Median FB Shares: 411.0
Mode FB Shares: 0        6
1       70
2       77
3      152
4      177
5      645
6     1300
7     2600
8     3100
9    46000
dtype: int64
STD Deviation FB Shares: 14304.498554650561
Variance of FB Share: 204618678.9


In [43]:
#Category: Travel FB SHARES
df_travel = df.loc[df['Category'] == 'Travel']
travel_fb = df_travel['clean_fb_shares']

fig = px.box(travel_fb, x = 'clean_fb_shares', title = 'Category-Travel: Distribution by FB Shares')
fig.show() 

#Category: Travel = Calculate the mean, mode, std deviation, and variance for the 3 categories' FB Shares 
print('Mean FB Shares:', travel_fb.mean()) 
print('Median FB Shares:', travel_fb.median())
print('Mode FB Shares:', travel_fb.mode())
print('STD Deviation FB Shares:', travel_fb.std())
print('Variance of FB Share:', travel_fb.var())

Mean FB Shares: 201.45833333333334
Median FB Shares: 105.0
Mode FB Shares: 0    0
dtype: int64
STD Deviation FB Shares: 261.21152927879933
Variance of FB Share: 68231.46302816903


In [44]:
#Cateory: Medical FB SHARES
df_medical = df.loc[df['Category'] == 'Medical']
medical_fb = df_medical['clean_fb_shares']

fig = px.box(medical_fb, x = 'clean_fb_shares', title = 'Category-Medical: Distribution by FB Shares')
fig.show()

#Category: Travel = Calculate the mean, mode, std deviation, and variance for the 3 categories' FB Shares 
print('Mean FB Shares:', travel_fb.mean()) 
print('Median FB Shares:', travel_fb.median())
print('Mode FB Shares:', travel_fb.mode())
print('STD Deviation FB Shares:', travel_fb.std())
print('Variance of FB Share:', travel_fb.var())

Mean FB Shares: 201.45833333333334
Median FB Shares: 105.0
Mode FB Shares: 0    0
dtype: int64
STD Deviation FB Shares: 261.21152927879933
Variance of FB Share: 68231.46302816903


In [45]:
#Category: Animal GFM Hearts
df_animals = df.loc[df['Category'] == 'Animals']
animals_gfm = df_animals['clean_gfm_hearts']

fig = px.box(animals_gfm, x = 'clean_gfm_hearts', title = 'Category-Animal: Distribution by GFM Hearts')
fig.show()

#Category: Travel = Calculate the mean, mode, std deviation, and variance for the 3 categories' GFM Hearts 
print('Mean GFM Hearts:', animals_gfm.mean()) 
print('Median GFM Hears:', animals_gfm.median())
print('Mode GFM Hearts', animals_gfm.mode())
print('STD Deviation GFM Hearts:', animals_gfm.std())
print('Variance of GFM Hearts:', animals_gfm.var())

Mean GFM Hearts: 1367.3
Median GFM Hears: 140.5
Mode GFM Hearts 0       30
1       44
2       52
3       54
4       63
5      218
6      513
7      599
8     1100
9    11000
dtype: int64
STD Deviation GFM Hearts: 3402.5615530394484
Variance of GFM Hearts: 11577425.122222224


In [46]:
#Category: Travel GFM Hearts
df_travel = df.loc[df['Category'] == 'Travel']
travel_gfm = df_travel['clean_gfm_hearts']

fig = px.box(travel_gfm, x = 'clean_gfm_hearts', title = 'Category-Travel: Distribution by GFM Hearts')
fig.show() 

#Category: Travel = Calculate the mean, mode, std deviation, and variance for the 3 categories' GFM Hearts 
print('Mean GFM Hearts:', travel_gfm.mean()) 
print('Median GFM Hears:', travel_gfm.median())
print('Mode GFM Hearts', travel_gfm.mode())
print('STD Deviation GFM Hearts:', travel_gfm.std())
print('Variance of GFM Hearts:', travel_gfm.var())

Mean GFM Hearts: 72.25
Median GFM Hears: 57.5
Mode GFM Hearts 0    47
1    65
dtype: int64
STD Deviation GFM Hearts: 61.063520379411145
Variance of GFM Hearts: 3728.7535211267605


In [47]:
#Cateory: Medical GFM Hearts
df_medical = df.loc[df['Category'] == 'Medical']
medical_gfm = df_medical['clean_gfm_hearts']


fig = px.box(medical_gfm, x = 'clean_gfm_hearts', title = 'Category-Medical: Distribution by GFM Hearts')
fig.show()

#Category: Travel = Calculate the mean, mode, std deviation, and variance for the 3 categories' GFM Hearts 
print('Mean GFM Hearts:', medical_gfm.mean()) 
print('Median GFM Hears:', medical_gfm.median())
print('Mode GFM Hearts', medical_gfm.mode())
print('STD Deviation GFM Hearts:', medical_gfm.std())
print('Variance of GFM Hearts:', medical_gfm.var())

Mean GFM Hearts: 1636.7894736842106
Median GFM Hears: 1050.0
Mode GFM Hearts 0    1100
dtype: int64
STD Deviation GFM Hearts: 1822.3405266545876
Variance of GFM Hearts: 3320924.99508772


**4a. Summarize: FB Shares**

In the first category, Animal, the average FB Shares was 5,413 per post. The minimum and maximum values were 6 and 64,000. However, the outlier was the maximum. The standard deviation was 14,304. In the second category, Travel, the average FB Shares were 202 per post. The minimum and maximum values were 0 and 1,100. There were three outliers: 669, 910, 1100. The standard deviation was 261. In the last category, Medical, the average FB Shares was 202 per post. The minimum and maximum values were 147 and 26,000 (the outlier), and the standard deviation was 261.

**4b. Summarize: GFM Hearts**
In the first category, Animal, the average GoFundMe Hearts was 1367 "likes" per post. The minimum and maximum values were 30 and 11,000, with 11,000 as the only outlier. The standard deviation was 3403. In the second category, Travel, the average GoFundMe Hearts was 72 "likes" per post. The minimum and maximum values were 1 and 351, with the outliers as 155, 165, 173, 190, 206, 251, and 351. The standard deviation was 61. In the last category, Medical, the average GFMhearts was 1637 "likes" per post. The minimum and maximum values were 106 and 12,000, with outliers as 4700, 5300, 5700, 8900, and 12,000. The standard deviation was 1822 "likes."

**4c Conclusions:** 
The outliers will affect the average of shares and hearts on a post. For example, Travel had the most outliers in Facebook shares and GoFundMe hearts. Therefore, the average might illustrates a slightly higher likelihood than expected. The animal category had the largest standard deviation for Facebook shares and GoFundMe hearts, which signifies how this category tends to have a larger spread and variance. Since the variation is large, the data may not be as reliable. 

# Explore on your own
For the following section, I decided to look at campaigns under the Education category. 

In [48]:
#1. Category: Education 
education = df.loc[df['Category'] == 'Education']

In [49]:
def check_goal(raised, goal): 
  res = ''
  if raised >= float(goal): 
    res = True 
  else: 
    res = False
  return res 

def return_margin(raised, goal): 
  res = '' 
  if check_goal(raised, goal) == 'True': 
    res = int(goal) - int(raised) 
    res = str(res)
  else: 
    res = int(raised) - int(goal) 
    res = str(res) 
  return res

check_goal_res = [] 
margin = [] 
for each_index in education['Unnamed: 0']: 
  check_goal_res.append(check_goal(education.loc[each_index, 'Amount_Raised'], education.loc[each_index, 'clean_goal']))
  margin.append(return_margin(education.loc[each_index, 'Amount_Raised'], education.loc[each_index, 'clean_goal']))

The code above returns a boolean to determine whether they met their fundraising goal, by taking the values of amount raised and the goal. After taking the value counts of this column (after adding it to the existing dataframe), there is higher likelihood of education fundraises not reaching their goal. The second function returns the margin between the amount raised and goal. If the function returns a "negative" value, then the fundraiser did not reach its goal. This is important to calculate in order to determine the level of success. 

The code below will add the new findings as columns in the "education" dataframe. 

In [50]:
#Adding the new values to dataframe: The Likelihood of Reaching Goal
education['Met_Goal'] = check_goal_res
education['Met_Goal'].value_counts()
education['Margin'] = margin

The next section of the code will look at the difference between the amount raised and the goal. The "Margin" column shows this difference. If the margin is "negative," then the fundraiser did not meet its goal. The for-loop will calculate the sum. The next few lines of code will compute the average. 

In [51]:
sum1 = 0 
sum2 = 0 
 
for each_index in education['Unnamed: 0']: 
  if education.loc[each_index, 'Met_Goal'] == False: 
    margin = education.loc[each_index, 'Margin'] 
    sum1 += int(margin)
  elif education.loc[each_index, 'Met_Goal'] == True: 
    margin = education.loc[each_index, 'Margin']
    sum2 += int(margin)
  else: 
    continue 

In [52]:
average1 = sum1 / len(education)
average2 = sum2/ len(education)
str_avg1 = '{:.2f}'.format(average1)
str_avg2 = '{:.2f}'.format(average2)
print('Average Difference of Not Meeting Goal: ' + str(str_avg1))
print('\n')
print('Average Difference of Meeting Goal: ' + str(str_avg2))

Average Difference of Not Meeting Goal: -23872.07


Average Difference of Meeting Goal: 7324.08


To show the counts of whether a fundraiser met its goal, the value_counts method can show the findings. 

In [54]:
education['Met_Goal'].value_counts() 

False    46
True     26
Name: Met_Goal, dtype: int64

In [53]:
edu_raised = 0 
edu_goal = 0 
time_margin = {'Duration': ['4 months', '5 months', '2 months', '1 month', '3 months', '6 months', 
                            '24 days', '23 days', '19 days', '13 days', '8 days'], '# of Donor(s)': [], 'Average Raised': [], 'Average Goal': [], 'Difference': []}

for each_time in time_margin['Duration']: 
  for each_mar in education.loc[education['Length_of_Fundraising'] == each_time]['Amount_Raised']: 
    edu_raised += each_mar
  avg = '{:.2f}'.format(edu_raised/len(education.loc[education['Length_of_Fundraising'] == each_time]))
  time_margin['Average Raised'].append(avg) 
  edu_raised = 0 

for each_time in time_margin['Duration']: 
  for each_mar in education.loc[education['Length_of_Fundraising'] == each_time]['clean_goal']: 
    edu_goal += int(each_mar) 
  avg = '{:.2f}'.format(edu_goal/len(education.loc[education['Length_of_Fundraising'] == each_time]))
  time_margin['Average Goal'].append(avg) 
  edu_goal = 0 

for i in range(len(time_margin['Duration'])): 
  time_margin['Difference'].append(float(time_margin['Average Raised'][i]) - float(time_margin['Average Goal'][i]))

avg_don = 0 
for each_time in time_margin['Duration']: 
  for each_mar in education.loc[education['Length_of_Fundraising'] == each_time]['Number_of_Donators']: 
    each_mar = each_mar.replace(',', '')
    avg_don += int(each_mar) 
  avg = int(avg_don/len(education.loc[education['Length_of_Fundraising'] == each_time]))
  time_margin['# of Donor(s)'].append(avg)
  avg_don = 0

time_margin_df = pd.DataFrame(time_margin)

The next line of code shows a dataframe (based on the duration) of all the values that were computed, including the donor count. This table is helpful to visually see and compare all of the data in one place. 

In [56]:
time_margin_df

Unnamed: 0,Duration,# of Donor(s),Average Raised,Average Goal,Difference
0,4 months,333,68458.93,82800.07,-14341.14
1,5 months,259,32645.4,64202.47,-31557.07
2,2 months,489,47071.58,62410.17,-15338.59
3,1 month,313,35609.33,49791.67,-14182.34
4,3 months,376,35240.0,40500.1,-5260.1
5,6 months,339,40797.33,60333.33,-19536.0
6,24 days,793,162958.0,150000.0,12958.0
7,23 days,293,53621.0,50000.0,3621.0
8,19 days,149,29041.0,75000.0,-45959.0
9,13 days,421,30000.0,25000.0,5000.0


In [57]:
fig = px.bar(education, x = 'Length_of_Fundraising', y = 'Amount_Raised', hover_data = {'Met_Goal', 'Amount_Raised', 'Goal', 'Margin'},color = 'Met_Goal')
fig.show() 

**4. Provide a one to two paragraph summary of the success of this category.** 

The first thing that I looked into was the amount raised and the goal of each fundraiser under education. I created a function that determines whether the fundraiser met its goal. In this category, 64% of fundraisers have not reached their goal, with an average of - $23,872 below. On the other hand, fundraisers that met their goal had an average of 7,324 dollars raised,above their goal. A 24-day fundraising length had the largest average for the amount raised out of all the education fundraisers.  The 24-day, 23-day, and the 13-day were the only fundraisers whose average raised were more than the average goal. There is a higher likelihood that future fundraisers with these time frames will reach their fundraising goal if the goal is the same or near the length's average goal. The graph shows the amount raised and whether they met their goal by time length. The 3-month timeframe seems to have a fifty percent chance of being successful. The 6-month, 13-days, and 8-days were timeframes that did not have any fundraisers that met their goal. The most popular timeframes are 4-months and 5-months. The 4-month timeframe did raise the most money, despite the fundraisers reaching their goal or not. 