# Final Project

**Group Name: Winter is Coming** <br>
**Group Slogan: Make you frozen**<br>
**Group Member: Yilin Xia, Wentao Cheng, Xiner Liu, Yuqi Kang**

## Basic Information

* Dataset name: Angelist Startups in Africa
* Data Source: Data.world 
* URL:https://data.world/omayeli/angelist-startups-in-africa
* License: **Public Domain**
The work has been dedicated to the public domain by waiving all rights to the work worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law. 
* General Information: 41 files represent the data collected from 41 countries,5.48 MB


## Operation in Advance
* Download the Zip file and unzip, then upload to the root directory folder "Africa"
* Combine the several documents into one name add new column "country"


## Extra Data
* twitter data 6917
* sort by date in order to show the latest news about angllist startups in Africa

In [1]:
import pandas as pd
import glob2
data=pd.read_csv("Africa_Data.csv")
del data["Unnamed: 0"]

In [2]:
data.columns

Index(['id', 'community_profile', 'name', 'angellist_url', 'logo_url',
       'thumb_url', 'quality', 'product_desc', 'high_concept',
       'follower_count', 'company_url', 'created_at', 'updated_at',
       'crunchbase_url', 'twitter_url', 'blog_url', 'facebook_url',
       'linkedin_url', 'video_url', 'markets', 'locations', 'company_size',
       'company_type', 'status', 'screenshots', 'Country'],
      dtype='object')

## Data Exploration
**Dataset Columns and Rows**<br />
4197rows and 26 columns

**Dataset Types** <br />except id is int64, community_profile is boolean, all others are object

**Missing Data** <br />From the percentage of null values in each attribute, it's easy to tell that except id, community_profile, name, angellist_url, logo_url, thumb_url, quality, follower_count, created_at, updated_at, markets, locations,Country, other attributes loose a lot of values.<br/><pre>
id                    0.000000
community_profile     0.000000
name                  0.047653
angellist_url         0.000000
logo_url              0.000000
thumb_url             0.000000
quality               0.000000
product_desc          6.290207
high_concept          2.644746
follower_count        0.000000
company_url           4.693829
created_at            0.000000
updated_at            0.000000
crunchbase_url       79.580653
twitter_url          72.218251
blog_url             89.563974
facebook_url         73.743150
linkedin_url         85.704074
video_url            85.632595
markets               0.000000
locations             0.071480
company_size         50.083393
company_type         67.953300
status               93.376221
screenshots          70.026209
Country               0.000000</pre>

**Explain the attributes** <br/>
* **id** unique number for each company in the database but not for the record
* **community_profile** This company has or do not have community profile. According to the result after groupby, more companies do not have community profile
* **name** Company Name
* **angellist_url** Company website on Angellist
* **logo_url** Company logo
* **thumb_url** Company logo in square shape
* **quality** Quality ranges from 0-10, and raw data contains two unusual values (False and True)
* **product_desc** The description of product companies produce
* **high_concept** Some hign concept the company need or create
* **follower_count** Number of Funders
* **markets** Related markets like country or industry
* **locations** detailed locations including cities
* **Country** Which country the company belongs to
* <br/>
* **company_size** The number of people the company contains, some unusual category appears
* **company_type** The type of the company like techonology or startups
* **status** Delete ( No useful information)
* **screenshots** Screenshot of the company official website

## Expected Result

According to the goal we want to achieve, we filter the columns to get what we need.
* **African Interactive Map**: country, id
* **WordCloud**: product_desc, high_concept, markets
* **Bar Chart**: quality, follower count, twitter_url
* **Detailed Information**: id, community profile, name, angellist_url, thumb_url, company_url
<img src="Default/Exp_1.png">
<img src="Default/Exp_2.png">
<img src="Default/Exp_3.png">

## Data Cleaning
* Replace the country name based on the "word_map_codes.csv"
* Delete the quality column which contains False and True
* Transfer elements in quality and follower count column into int

In [3]:
word_map_code=pd.read_csv("Default/world_map_codes.csv")
#result: Central_african_republic-Central African Republic/ South_africa-South Africa /Sierra_leone-Sierra Leone
for num in range(len(data)):
    if data.iloc[num,-1]=="Central_african_republic":
        data.iloc[num,-1]="Central African Republic"
    elif data.iloc[num,-1]=="South_africa":
        data.iloc[num,-1]="South Africa"
    elif data.iloc[num,-1]=="Sierra_leone":
        data.iloc[num,-1]="Sierra Leone"
#for data_coun in data["Country"].unique():
    #if data_coun not in word_map_code["Name"].unique():
        #print(data_coun)

country_cont=data.groupby('Country').count()['id'].reset_index()
country_cont.columns=["Country","Count"]

#Add Other Africa Name
country_codes=pd.read_csv("Default/country_codes.csv")
for name in country_codes[country_codes["Continent"]=="Africa"]["Name"].unique():
    if name not in country_cont["Country"].unique():
        temp_row = []
        temp_row.insert(0, {'Country': name, 'Count': 0})
        country_cont=pd.concat([pd.DataFrame(temp_row),country_cont], ignore_index=True,sort=False)
#Create a dic to search the name and id
name_dic=dict()
for row in range(len(word_map_code)):
    name_dic[word_map_code.iloc[row,0]]=word_map_code.iloc[row,3]
    
#name_dic[country_cont.iloc[0,1]]



In [4]:
#delete the quality and follower columns containing True or False
data=data[data['quality']!="False"]
data=data[data['quality']!="True"]
data=data[data['follower_count']!="False"]
data=data[data['follower_count']!="True"]
# data["quality"].unique()
data["quality"] = pd.to_numeric(data["quality"])
data["follower_count"] = pd.to_numeric(data["follower_count"])

In [5]:
tw=pd.read_csv("Default/twitter_data.csv")
Detailed_info = data[['Country', 'name', 'angellist_url', 'company_url']]

## Project Design

In [6]:
import bqplot
from ipywidgets import Button, GridBox, Layout, ButtonStyle
import ipywidgets as widgets
import pandas as pd
import nltk
from wordcloud import WordCloud
from wordcloud import STOPWORDS
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.image as mpimg

In [7]:
co_id=[]
for row in range(len(country_cont)):
    co_id.append(name_dic[country_cont.iloc[row,1]])
country_cont["id"]=co_id
country_cont=country_cont[["id","Count"]]
dic=dict()
for row in range(len(country_cont)):
    dic[country_cont.iloc[row,0]]=country_cont.iloc[row,1]

In [8]:
import bqplot
col_sc=bqplot.ColorScale(scheme="Blues")
c_ax=bqplot.ColorAxis(scale=col_sc,orientation="vertical",side="right",tick_style={"font-size":7})
def_tt = bqplot.Tooltip(fields=["name"])
map_mark = bqplot.Map(scales={'projection': bqplot.Mercator(center=(6.61,20.93),scale_factor=500),'color':col_sc}, 
                      color=dic,
                      tooltip=def_tt, 
                      interactions = { "click":"select",'hover': 'tooltip'},
                      selected_styles={"selected_fill": "Red"},
                      hovered_styles={"hovered_fill": "Orange"})
fig=bqplot.Figure(marks=[map_mark], axes=[c_ax], title='Angellist Startups in Africa')
africa_map = mpimg.imread('Default/Africa_Map.jpg')
image_pro = widgets.Image(width=250,height=75,)
image_pro.value = open("Default/Africa_Map.jpg", "rb").read()
image_high = widgets.Image(width=250,height=75,)
image_high.value = open("Default/Africa_Map.jpg", "rb").read()
image_mark = widgets.Image(width=250,height=75,)
image_mark.value = open("Default/Africa_Map.jpg", "rb").read()
#===============================================

ta1=widgets.Textarea(layout=widgets.Layout(width='95%', height='100px'),disabled=False)
ta2=widgets.Textarea(layout=widgets.Layout(width='95%', height='100px'))
ta3=widgets.Textarea(layout=widgets.Layout(width='95%', height='100px'))
ta4=widgets.Textarea(layout=widgets.Layout(width='95%', height='100px'))
ip=widgets.IntProgress()
button_plus=widgets.Button(description="Next Page>>>",
                           style=widgets.ButtonStyle(button_color='lightblue'))
button_minus=widgets.Button(description="<<<Previous Page",
                           style=widgets.ButtonStyle(button_color='moccasin'))

#===============================================
country0=widgets.Text()
name0=widgets.Text()
angel0=widgets.Text()
company0=widgets.Text()

country=widgets.Text()
name=widgets.Text()
angel=widgets.Text()
company=widgets.Text()
tab1 = widgets.Tab(children=[country,name,angel,company])
tab1.set_title(0,'Country')
tab1.set_title(1,'Company_name')
tab1.set_title(2,'Angellist_url')
tab1.set_title(3,'Company_url')

x=widgets.Select(
    options=[],
    #value='',
    #rows=10,
    description='Company:',
    disabled=False
)

#===============================================
def click_down(event):
    ip.value -= 4
    ta1.value=sel_tw[ip.value]
    ta2.value=sel_tw[ip.value+1]
    ta3.value=sel_tw[ip.value+2]
    ta4.value=sel_tw[ip.value+3]
button_minus.on_click(click_down)

def click_up(event):
    if ip.value>=len(sel_tw)-4:
        ip.value=ip.value
    else:
        ip.value += 4
        ta1.value=sel_tw[ip.value]
        ta2.value=sel_tw[ip.value+1]
        ta3.value=sel_tw[ip.value+2]
        ta4.value=sel_tw[ip.value+3]
button_plus.on_click(click_up)
tw_=widgets.VBox([widgets.VBox([ta1,ta2,ta3,ta4]),
              widgets.HBox([button_minus,  button_plus])])

def print_info(event): 
    names=event["new"]
    if names is None:
        pass
    else:
        number=name0.value.split(',').index(names)
        tab1.children[0].value=country0.value.split(',')[number]
        tab1.children[1].value=name0.value.split(',')[number]
        tab1.children[2].value=angel0.value.split(',')[number]
        tab1.children[3].value=company0.value.split(',')[number]

def list_to_string(data):
    text_list=data.tolist()
    text_string=','.join(str(t) for t in text_list)
    return text_string
def show_wordcloud(data,mask_pic):
    text_list=data.tolist()
    text_string=''.join(str(t) for t in text_list)
    wordcloud = WordCloud(mask=mask_pic,background_color='white',scale=10,stopwords=STOPWORDS).generate(text_string)
    return wordcloud

def selection_changed(event):
    global sel_data
    global sel_cos
    global sel_tw
    sel_data=pd.DataFrame()
    sel_cos=[]
    sel_tw=[]
    if event["new"] is None:
        image_pro.value = open("Default/Africa_Map.jpg", "rb").read()
        image_high.value = open("Default/Africa_Map.jpg", "rb").read()
        image_mark.value = open("Default/Africa_Map.jpg", "rb").read()
        hist = bqplot.Hist(scales = {'sample': x_sc, 'count': y_sc})
        widgets.link((hist,"sample"),(hists,"sample"))
        ta1.value=""
        ta2.value=""
        ta3.value=""
        ta4.value=""
        x.options=[]
        country.value=""
        name.value=""
        angel.value=""
        company.value=""
        pass
    else:
        for sel_co in event["new"]:
            sel_cos.append(
                word_map_code[word_map_code["ISON3"]==int(sel_co)]["Name"].values[0])
    
    for co in sel_cos:
        for tweet in tw[tw["country"]==co]["Tweets"]:
            sel_tw.append(tweet)
    for sel_co in sel_cos:
        sel_data=sel_data.append(data[data["Country"]==sel_co])
    #selected countries: sel_cos
    #selected data: sel_data
    #selected twitter: sel_tw
    product_desc=show_wordcloud(data['product_desc'],africa_map)    
    product_desc.to_file('product_desc.png')
    markets=show_wordcloud(data['markets'],africa_map)
    markets.to_file('markets.png')
    high_concept=show_wordcloud(data['high_concept'],africa_map)
    high_concept.to_file('high_concept.png')
    image_pro.value = open("product_desc.png", "rb").read()
    image_mark.value = open("markets.png", "rb").read()
    image_high.value = open("high_concept.png", "rb").read()
    
    if len(sel_data)>0:
        hist=bqplot.Hist(sample = sel_data["quality"],
                        scales = {'sample': x_sc, 'count': y_sc})
        widgets.link((hist,"sample"),(hists,"sample"))
    else:
        pass
    if len(sel_tw)>=4:
        ta1.value=sel_tw[ip.value]
        ta2.value=sel_tw[ip.value+1]
        ta3.value=sel_tw[ip.value+2]
        ta4.value=sel_tw[ip.value+3]  
    #==========================================
    #Please edit here
    if len(sel_cos)==0:
        pass
    else:
        company_data = pd.DataFrame()
        for row in sel_cos:
            company_data=pd.concat([company_data,Detailed_info.loc[Detailed_info['Country'] == row]])
            #a = (company_data.name.values).tolist()
        country0.value=list_to_string(company_data['Country'])
        name0.value=list_to_string(company_data['name'])
        angel0.value=list_to_string(company_data['angellist_url'])
        company0.value=list_to_string(company_data['company_url'])
        x.options=company_data['name'].tolist()

        
        

In [9]:
#Map and Image
map_mark.observe(selection_changed,"selected")   
Pro_WK1="A product description is the marketing copy used to describe a product's "
Pro_WK2="value proposition to potential customers."
Mark_WK1="A market is one of the many varieties of systems, institutions, procedures"
Mark_WK2="social relations and infrastructures whereby parties engage in exchange. "
High_WK1="High-concept is a type of artistic work that can be easily pitched with"
High_WK2="a succinctly stated premise."

list_widgets  = [
    widgets.VBox([image_pro, widgets.Label(Pro_WK1),widgets.Label(Pro_WK2)]),
    widgets.VBox([image_mark, widgets.Label(Mark_WK1), widgets.Label(Mark_WK2)]),
    widgets.VBox([image_high, widgets.Label(High_WK1),widgets.Label(High_WK2)]),]

tab = widgets.Tab(children=list_widgets, 
                  layout=Layout(width='50%', height='420px'))

tab.set_title(0, 'Product Description')
tab.set_title(1, 'Markets')
tab.set_title(2, 'High Concept')
#Histagram
x_sc = bqplot.LinearScale()
y_sc = bqplot.LinearScale()
x_ax = bqplot.Axis(scale = x_sc,label="Quality Category",tick_style={"font-size":8},tick_format="0f")
y_ax = bqplot.Axis(scale = y_sc, label="Quality Count",orientation="vertical",tick_style={"font-size":8})
hists = bqplot.Hist(scales = {'sample': x_sc, 'count': y_sc})
fig1 = bqplot.Figure(marks = [hists], axes = [x_ax, y_ax],title='Quality Score by Area')

#Compnay Detail
x.observe(print_info,names='value')

In [10]:
widgets.HBox([fig,tab])

HBox(children=(Figure(axes=[ColorAxis(orientation='vertical', scale=ColorScale(scheme='Blues'), side='right', …

In [11]:
widgets.HBox([fig,fig1])

HBox(children=(Figure(axes=[ColorAxis(orientation='vertical', scale=ColorScale(scheme='Blues'), side='right', …

In [12]:
widgets.HBox([fig,tw_])

HBox(children=(Figure(axes=[ColorAxis(orientation='vertical', scale=ColorScale(scheme='Blues'), side='right', …

In [13]:
display(widgets.VBox([fig]))
display(x)
display(tab1)

VBox(children=(Figure(axes=[ColorAxis(orientation='vertical', scale=ColorScale(scheme='Blues'), side='right', …

Select(description='Company:', options=(), value=None)

Tab(children=(Text(value=''), Text(value=''), Text(value=''), Text(value='')), _titles={'0': 'Country', '1': '…