# Lets look at Data Science vs. Machine Learning!

We'll use a few algorithms to investigate. But first, we get the data, which consists of 150 job descriptions from searches for "data science jobs", "machine learning jobs", and "jobs" (for the control data). (Data Science = ds, Machine Learning = ml)

In [10]:
desc_file_names = ["cont_desc$(k).txt" for k=1:40]; #save 10 files for testing

control_desc = [readstring("C_group_utf\\$k") for k in desc_file_names];
ds_desc = [readstring("D_group_utf\\$k") for k in desc_file_names];
ml_desc = [readstring("M_group_utf\\$k") for k in desc_file_names];

Now we preprocess. The goal is to get dicts for each data set, in the form "word" => number of instances. We first use regular expressions to divide into arrays of words, and then build the dicts.

In [25]:
function text2Words(text)
  matchall(r"(\w+)",lowercase(replace(text, r"([\r\n,\.\(\)!;:\?/]|\ufeff)", s" ")))
end

control_words = text2Words(join(control_desc, " "));
ds_words = text2Words(join(ds_desc, " "));
ml_words = text2Words(join(ml_desc, " "));

function wordCount(words)
  worddict = Dict{String,Int64}()
  for w in words
    worddict[w]=get(worddict,w,0) + 1
  end
  worddict
end

control_dict = wordCount(control_words);
ds_dict = wordCount(ds_words);
ml_dict = wordCount(ml_words);



So, for example, how many times does "data" occur in the data science descriptions?

In [26]:
ds_dict["data"]

400

## WordNorm Comparison to Control

Okay, now lets get to our first algorithm. WordNorm comes in several flavors, as outlined in the project description. Essentially, though, they are all a variant on $$WN_{\alpha\beta}(w) = \frac{\alpha(w)-\beta(w)}{\alpha(w)+\beta(w)}$$ where $\alpha$ and $\beta$ represent word counts, and $w$ is a word. Here we'll start with the one for comparison to the control, which adds 1 to the word count for control words.

In [27]:
function wordNormCont(sample, control)
  norm_sample = Dict{String,Float64}()
  for k in keys(sample)
    c = get(control,k,0)+1
    norm_sample[k] = (sample[k]-c)/(sample[k]+c)
  end
  norm_sample
end

wn_ds = wordNormCont(ds_dict, control_dict);
wn_ml = wordNormCont(ml_dict, control_dict);



Lets get the top 10 words in each category, with their score!

In [35]:
function maxWords(sample, num_words)
  sort(collect(sample), by=tuple -> last(tuple),rev=true)[1:num_words]
end



maxWords (generic function with 1 method)

 overwritten at In[35]:2.


In [36]:
maxWords(wn_ds, 10)

10-element Array{Pair{String,Float64},1}:
 "quantitative"=>0.941176 
 "r"=>0.935484            
 "statistical"=>0.935484  
 "statistics"=>0.931034   
 "predictive"=>0.916667   
 "python"=>0.9            
 "visualization"=>0.888889
 "spark"=>0.882353        
 "academy"=>0.875         
 "hadoop"=>0.875          

In [37]:
maxWords(wn_ml,10)

10-element Array{Pair{String,Float64},1}:
 "learning"=>0.941176    
 "machine"=>0.941176     
 "ai"=>0.923077          
 "algorithms"=>0.918367  
 "ml"=>0.916667          
 "r"=>0.913043           
 "spark"=>0.909091       
 "python"=>0.902439      
 "quantitative"=>0.888889
 "hadoop"=>0.882353      

Now we plot:

In [42]:
using Plots

function barWordNormCtrl(data_norm, ml_norm)
  #arrays of pairs word => WordNorm_ds/ml(word) for top 10 ds/ml words in order
  ds_top = maxWords(data_norm, 10)
  ml_top = maxWords(ml_norm, 10)

  #arrays of WordNorm_ds/ml(word) for top 10 ds/ml words in order
  ds_top_num = [ds_top[k][2] for k=1:10]
  ml_top_num = [ml_top[k][2] for k=1:10]

  #arrays of word for top 10 ds/ml words in order
  ds_top_word = [ds_top[k][1] for k=1:10]
  ml_top_word = [ml_top[k][1] for k=1:10]

  #arrays of WordNorm_ml/ds(word) for top 10 ds/ml words in order
  ds_top_ml_num = [get(ml_norm,k,-1) for k in ds_top_word]
  ml_top_ds_num = [get(data_norm,k,-1) for k in ml_top_word]

  m="Machine Learning"
  d="Data Science"

  bar([1:10 1:10 0.85:9.85 0.85:9.85], [ds_top_ml_num ml_top_ds_num ds_top_num ml_top_num],
        label=[nothing "$d WordNorm" nothing "$m WordNorm"], legend=[false true],
        bar_width = 0.8, layout = 2, ylims = (-1,1),
        title=["WordNorm for Top 10 $d Words" "WordNorm for Top 10 $m Words"],
        left_margin = 2*mm, right_margin=2*mm, top_margin=2*mm, bottom_margin=5*mm,
        xrotation = rad2deg(pi/3), size = (1000,500),
        xticks = [(1:10, ds_top_word) (1:10, ml_top_word)],
        color = [:orange :blue :blue :orange])
end

barWordNormCtrl(wn_ds, wn_ml)



Any) in module Main at In[41]:5 overwritten at In[42]:5.


We see that there's a good bit of overlap, but some differences, which make lots of sense! "visualization", for example, is clearly something important to Data Science, but not as important to Machine Learning. "ai" is the clear comparative winner for Machine Learning. "academy" required a deeper look, but it turns out that this appears a bunch of times in one specific Machine Learning job description, so we should probably just ignore this.

## Word Norm: Comparing DS to ML directly

Now rather than individually comparing DS and ML to the control, we compare them directly. This involves modifying WordNorm by adding $\frac{1}{2}$ to each word count in both samples.

In [43]:
function compareNorm(sample, csample)
  comp = Dict{String,Float64}()
  for k in keys(sample)
    c = get(csample,k,0)
    comp[k] = (sample[k]-c)/(sample[k]+c+1)
  end
  comp
end

mlvsds = compareNorm(ml_dict, ds_dict);
dsvsml = compareNorm(ds_dict, ml_dict);

In [44]:
maxWords(mlvsds, 10)

10-element Array{Pair{String,Float64},1}:
 "ai"=>0.961538        
 "speech"=>0.933333    
 "coupa"=>0.916667     
 "tensorflow"=>0.909091
 "samsung"=>0.909091   
 "ford"=>0.888889      
 "ml"=>0.88            
 "caffe"=>0.875        
 "gpu"=>0.857143       
 "extraction"=>0.857143

In [45]:
maxWords(dsvsml, 10)

10-element Array{Pair{String,Float64},1}:
 "nielsen"=>0.967742 
 "academy"=>0.9375   
 "aig"=>0.928571     
 "director"=>0.928571
 "office"=>0.923077  
 "credit"=>0.916667  
 "tivo"=>0.909091    
 "its"=>0.909091     
 "integral"=>0.9     
 "consumers"=>0.9    