<h1>Deep Fin: Knowledge Graphs</h1>

<h2>Agenda</h2>
<ul>
  <li>A perspective on the journey towards a data-driven enterprise</li>
<li>Tapping into actionable insights that exists in unstructured text</li>
<li>Moving towards derived knowledge </li>
<li>Architectural capabilities to enable a domain specific knowledge graph</li>
</ul>

<h1>1: Introduction</h1>

Enterprises ambitions to being data-driven organizations 
<ul>
  <li>Analytics is broader than BI and predictive modelling</li>
<li>Text analytics brings unstructured information into the equation</li>
<li>This mix of information may involve many layers of complexity and interconnected relationships and it won’t easily fit into a structured database or data warehouse</li>
</ul>

|Question Data| Analysis Technique|
| :- | :--- |
|Given set of inputs, predict asset price direction?| Support Vector Classifier, Logistic Regression, Lasso Regression, etc.|
|How will a sharp move in one asset affect other assets?| Impulse Response Function, Granger Causality|
|Is an asset diverging from other related assets?|  One-vs-rest classification|
|Which assets move together?| Affinity Propagation, Manifold Embedding|
|What factors are driving asset price? Is the asset move excessive, and will it revert?| Principal Component Analysis,Independent Component Analysis|
|What is the current market regime?| Soft-max classification, Hidden Markov Model|
|What is the probability of an event?| Decision Tree, Random Forest|
|What are the most common signs of market stress? |K-means clustering|
|Find signals in noisy dat|a Low-pass filters, SVM|
|Predict volatility based on a large number of input variables| Restricted Boltzmann Machine, SVM|
|What is the sentiment of an article / text? |Bag of words|
|What is the topic of an article/text? |Term/InverseDocument Frequency|
|Counting objects in an image (satellite, drone, etc)| Convolutional Neural Nets|
|What should be optimal execution speed?| Reinforcement Learning using Partially Observed Markov Decision Process|

*source: [JPMorgan Big Data & AI Strategies in investment banking](https://jpmcsso.jpmorgan.com/sso/action/federateLogin?URI=https%3a%2f%2fmarkets.jpmorgan.com%2fcontainer-web%2fua%3fsourceUrl%3dhttps%3a%2f%2fmarkets.jpmorgan.com%2fresearch%2fcontent%2fGPS-2345119-0%253f&msg=+&securityLevel=0&multiDomainRequiredCheck=true&app=585580&ref=585579&cs=awfEaC1gi3DK%2fu3BQpyLX9jZfw0%3d)*<br/>
<img src="https://s3.amazonaws.com/com.ravenpack.cms/pages/jp-morgan-big-data-ai-5467A190.jpg"/>
    
<h2>1.1: Today's reality</h2>
<img src="https://github.com/salimngit/DeepFin-Series-JPMorgan/raw/master/images/data_reality.PNG" />

<h2>1.2: The Proposed Journey </h2>
<img src="https://github.com/salimngit/DeepFin-Series-JPMorgan/raw/master/images/Semantic_Data_Lake.PNG" />

<table border="0" align="left"><tr>  
  <tr><td colspan="2"><h2>1.3: Advantages of Using Graphs</h2></td></tr>
  <tr>
<td><ul>
<li>The way we as humans derive knowledge fits well how information is modelled and stored when using graphs</li>
<li>Graphs serve as a universal meta-language to link information from structured and unstructured data</li>
<li>Graphs open up doors to a better aligned data management  (informal text & unstructured data)</li>
<li>Graph-based semantic models can also be understood by subject matter / domain experts</li>
  </ul></td>
  <td><img src="https://qph.fs.quoracdn.net/main-qimg-466db67bbc51c729b9d9f13520b63e67"/></td>
  </tr></table>

<h1>2: Type of Semantic Knowledge Graphs</h1>

| * | * |
| :- | :--- |
|**Web Knowledge Graph**| **Social Graphs** |
|*DBpedia , Freebase, Bing,etc.*|*Facebook, LinkedIn, etc.*|
|**Enterprise Collaboration Graphs**|**Industry Specific Graphs**|
|*Office Graph*|*UMLS, SnowMed, Custom*|


<h2>2.1: Common Example</h2>
<h3>Demo: <a href="http://www.bing.com" target="_blank">Bing.com</a></h3>
<img src="https://github.com/salimngit/DeepFin-Series-JPMorgan/raw/master/images/Example_Knowledge_Graph.PNG" />

<h1>3: Building a Semantic Knowledge Graph</h1>

<h2>3.1: Why Formulate as a Graph</h2>
<table>
  <tr>
    
    <td>
      <h4>Connections are often more insightful than Data itself</h4>
      Reality - real-world data are increasing in volume, velocity and variety, with more external data signals. RDMS approaches are not effective in catering for increased number of relationships and often data relationships are  more valuable growing at a faster rate when you expand to capture additional signals.<br/>  
    </td>
    <td><img src="https://lucidchart.zendesk.com/hc/article_attachments/360001070323/Entity_Relationship_Diagram_-_New_Page.png" width="70%" /></td>
  </tr>      
</table>

<h2>3.2: What about RDBMS and ANSI SQL</h2>
<i>source: sql2gremlin.com</i><br/>
<h3>Recommendation</h3>
This sample shows how to recommend 5 products for a specific customer. The products are chosen as follows:
<ul>
<li>determine what the customer has already ordered</li>
<li>determine who else ordered the same products</li>
<li>determine what others also ordered</li>
<li>determine products which were not already ordered by the initial customer, but ordered by the others</li>
<li>rank products by occurence in other orders</li>
</ul>

<div class="paragraph"><p><strong>SQL</strong></p></div>
<div class="listingblock">
<div class="content"><!-- Generator: GNU source-highlight
by Lorenzo Bettini
http://www.lorenzobettini.it
http://www.gnu.org/software/src-highlite -->
<pre><tt>  <span style="font-weight: bold"><span style="color: #0000FF">SELECT</span></span> TOP <span style="color: #990000">(</span><span style="color: #993399">5</span><span style="color: #990000">)</span> <span style="color: #990000">[</span>t14<span style="color: #990000">].[</span>ProductName<span style="color: #990000">]</span>
    <span style="font-weight: bold"><span style="color: #0000FF">FROM</span></span> <span style="color: #990000">(</span><span style="font-weight: bold"><span style="color: #0000FF">SELECT</span></span> COUNT<span style="color: #990000">(*)</span> <span style="font-weight: bold"><span style="color: #0000FF">AS</span></span> <span style="color: #990000">[</span>value<span style="color: #990000">],</span>
                 <span style="color: #990000">[</span>t13<span style="color: #990000">].[</span>ProductName<span style="color: #990000">]</span>
            <span style="font-weight: bold"><span style="color: #0000FF">FROM</span></span> <span style="color: #990000">[</span>customers<span style="color: #990000">]</span> <span style="font-weight: bold"><span style="color: #0000FF">AS</span></span> <span style="color: #990000">[</span>t0<span style="color: #990000">]</span>
     <span style="font-weight: bold"><span style="color: #0000FF">CROSS</span></span> APPLY <span style="color: #990000">(</span><span style="font-weight: bold"><span style="color: #0000FF">SELECT</span></span> <span style="color: #990000">[</span>t9<span style="color: #990000">].[</span>ProductName<span style="color: #990000">]</span>
                    <span style="font-weight: bold"><span style="color: #0000FF">FROM</span></span> <span style="color: #990000">[</span>orders<span style="color: #990000">]</span> <span style="font-weight: bold"><span style="color: #0000FF">AS</span></span> <span style="color: #990000">[</span>t1<span style="color: #990000">]</span>
              <span style="font-weight: bold"><span style="color: #0000FF">CROSS</span></span> <span style="font-weight: bold"><span style="color: #0000FF">JOIN</span></span> <span style="color: #990000">[</span><span style="font-weight: bold"><span style="color: #0000FF">order</span></span> details<span style="color: #990000">]</span> <span style="font-weight: bold"><span style="color: #0000FF">AS</span></span> <span style="color: #990000">[</span>t2<span style="color: #990000">]</span>
              <span style="font-weight: bold"><span style="color: #0000FF">INNER</span></span> <span style="font-weight: bold"><span style="color: #0000FF">JOIN</span></span> <span style="color: #990000">[</span>products<span style="color: #990000">]</span> <span style="font-weight: bold"><span style="color: #0000FF">AS</span></span> <span style="color: #990000">[</span>t3<span style="color: #990000">]</span>
                      <span style="font-weight: bold"><span style="color: #0000FF">ON</span></span> <span style="color: #990000">[</span>t3<span style="color: #990000">].[</span>ProductID<span style="color: #990000">]</span> <span style="color: #990000">=</span> <span style="color: #990000">[</span>t2<span style="color: #990000">].[</span>ProductID<span style="color: #990000">]</span>
              <span style="font-weight: bold"><span style="color: #0000FF">CROSS</span></span> <span style="font-weight: bold"><span style="color: #0000FF">JOIN</span></span> <span style="color: #990000">[</span><span style="font-weight: bold"><span style="color: #0000FF">order</span></span> details<span style="color: #990000">]</span> <span style="font-weight: bold"><span style="color: #0000FF">AS</span></span> <span style="color: #990000">[</span>t4<span style="color: #990000">]</span>
              <span style="font-weight: bold"><span style="color: #0000FF">INNER</span></span> <span style="font-weight: bold"><span style="color: #0000FF">JOIN</span></span> <span style="color: #990000">[</span>orders<span style="color: #990000">]</span> <span style="font-weight: bold"><span style="color: #0000FF">AS</span></span> <span style="color: #990000">[</span>t5<span style="color: #990000">]</span>
                      <span style="font-weight: bold"><span style="color: #0000FF">ON</span></span> <span style="color: #990000">[</span>t5<span style="color: #990000">].[</span>OrderID<span style="color: #990000">]</span> <span style="color: #990000">=</span> <span style="color: #990000">[</span>t4<span style="color: #990000">].[</span>OrderID<span style="color: #990000">]</span>
               <span style="font-weight: bold"><span style="color: #0000FF">LEFT</span></span> <span style="font-weight: bold"><span style="color: #0000FF">JOIN</span></span> <span style="color: #990000">[</span>customers<span style="color: #990000">]</span> <span style="font-weight: bold"><span style="color: #0000FF">AS</span></span> <span style="color: #990000">[</span>t6<span style="color: #990000">]</span>
                      <span style="font-weight: bold"><span style="color: #0000FF">ON</span></span> <span style="color: #990000">[</span>t6<span style="color: #990000">].[</span>CustomerID<span style="color: #990000">]</span> <span style="color: #990000">=</span> <span style="color: #990000">[</span>t5<span style="color: #990000">].[</span>CustomerID<span style="color: #990000">]</span>
              <span style="font-weight: bold"><span style="color: #0000FF">CROSS</span></span> <span style="font-weight: bold"><span style="color: #0000FF">JOIN</span></span> <span style="color: #990000">([</span>orders<span style="color: #990000">]</span> <span style="font-weight: bold"><span style="color: #0000FF">AS</span></span> <span style="color: #990000">[</span>t7<span style="color: #990000">]</span>
                          <span style="font-weight: bold"><span style="color: #0000FF">CROSS</span></span> <span style="font-weight: bold"><span style="color: #0000FF">JOIN</span></span> <span style="color: #990000">[</span><span style="font-weight: bold"><span style="color: #0000FF">order</span></span> details<span style="color: #990000">]</span> <span style="font-weight: bold"><span style="color: #0000FF">AS</span></span> <span style="color: #990000">[</span>t8<span style="color: #990000">]</span>
                          <span style="font-weight: bold"><span style="color: #0000FF">INNER</span></span> <span style="font-weight: bold"><span style="color: #0000FF">JOIN</span></span> <span style="color: #990000">[</span>products<span style="color: #990000">]</span> <span style="font-weight: bold"><span style="color: #0000FF">AS</span></span> <span style="color: #990000">[</span>t9<span style="color: #990000">]</span>
                                  <span style="font-weight: bold"><span style="color: #0000FF">ON</span></span> <span style="color: #990000">[</span>t9<span style="color: #990000">].[</span>ProductID<span style="color: #990000">]</span> <span style="color: #990000">=</span> <span style="color: #990000">[</span>t8<span style="color: #990000">].[</span>ProductID<span style="color: #990000">])</span>
                   <span style="font-weight: bold"><span style="color: #0000FF">WHERE</span></span> <span style="font-weight: bold"><span style="color: #0000FF">NOT</span></span> <span style="font-weight: bold"><span style="color: #0000FF">EXISTS</span></span><span style="color: #990000">(</span><span style="font-weight: bold"><span style="color: #0000FF">SELECT</span></span> <span style="font-weight: bold"><span style="color: #0000FF">NULL</span></span> <span style="font-weight: bold"><span style="color: #0000FF">AS</span></span> <span style="color: #990000">[</span>EMPTY<span style="color: #990000">]</span>
                                      <span style="font-weight: bold"><span style="color: #0000FF">FROM</span></span> <span style="color: #990000">[</span>orders<span style="color: #990000">]</span> <span style="font-weight: bold"><span style="color: #0000FF">AS</span></span> <span style="color: #990000">[</span>t10<span style="color: #990000">]</span>
                                <span style="font-weight: bold"><span style="color: #0000FF">CROSS</span></span> <span style="font-weight: bold"><span style="color: #0000FF">JOIN</span></span> <span style="color: #990000">[</span><span style="font-weight: bold"><span style="color: #0000FF">order</span></span> details<span style="color: #990000">]</span> <span style="font-weight: bold"><span style="color: #0000FF">AS</span></span> <span style="color: #990000">[</span>t11<span style="color: #990000">]</span>
                                <span style="font-weight: bold"><span style="color: #0000FF">INNER</span></span> <span style="font-weight: bold"><span style="color: #0000FF">JOIN</span></span> <span style="color: #990000">[</span>products<span style="color: #990000">]</span> <span style="font-weight: bold"><span style="color: #0000FF">AS</span></span> <span style="color: #990000">[</span>t12<span style="color: #990000">]</span>
                                        <span style="font-weight: bold"><span style="color: #0000FF">ON</span></span> <span style="color: #990000">[</span>t12<span style="color: #990000">].[</span>ProductID<span style="color: #990000">]</span> <span style="color: #990000">=</span> <span style="color: #990000">[</span>t11<span style="color: #990000">].[</span>ProductID<span style="color: #990000">]</span>
                                     <span style="font-weight: bold"><span style="color: #0000FF">WHERE</span></span> <span style="color: #990000">[</span>t9<span style="color: #990000">].[</span>ProductID<span style="color: #990000">]</span> <span style="color: #990000">=</span> <span style="color: #990000">[</span>t12<span style="color: #990000">].[</span>ProductID<span style="color: #990000">]</span>
                                       <span style="font-weight: bold"><span style="color: #0000FF">AND</span></span> <span style="color: #990000">[</span>t10<span style="color: #990000">].[</span>CustomerID<span style="color: #990000">]</span> <span style="color: #990000">=</span> <span style="color: #990000">[</span>t0<span style="color: #990000">].[</span>CustomerID<span style="color: #990000">]</span>
                                       <span style="font-weight: bold"><span style="color: #0000FF">AND</span></span> <span style="color: #990000">[</span>t11<span style="color: #990000">].[</span>OrderID<span style="color: #990000">]</span> <span style="color: #990000">=</span> <span style="color: #990000">[</span>t10<span style="color: #990000">].[</span>OrderID<span style="color: #990000">])</span>
                     <span style="font-weight: bold"><span style="color: #0000FF">AND</span></span> <span style="color: #990000">[</span>t6<span style="color: #990000">].[</span>CustomerID<span style="color: #990000">]</span> <span style="color: #990000">&lt;&gt;</span> <span style="color: #990000">[</span>t0<span style="color: #990000">].[</span>CustomerID<span style="color: #990000">]</span>
                     <span style="font-weight: bold"><span style="color: #0000FF">AND</span></span> <span style="color: #990000">[</span>t1<span style="color: #990000">].[</span>CustomerID<span style="color: #990000">]</span> <span style="color: #990000">=</span> <span style="color: #990000">[</span>t0<span style="color: #990000">].[</span>CustomerID<span style="color: #990000">]</span>
                     <span style="font-weight: bold"><span style="color: #0000FF">AND</span></span> <span style="color: #990000">[</span>t2<span style="color: #990000">].[</span>OrderID<span style="color: #990000">]</span> <span style="color: #990000">=</span> <span style="color: #990000">[</span>t1<span style="color: #990000">].[</span>OrderID<span style="color: #990000">]</span>
                     <span style="font-weight: bold"><span style="color: #0000FF">AND</span></span> <span style="color: #990000">[</span>t4<span style="color: #990000">].[</span>ProductID<span style="color: #990000">]</span> <span style="color: #990000">=</span> <span style="color: #990000">[</span>t3<span style="color: #990000">].[</span>ProductID<span style="color: #990000">]</span>
                     <span style="font-weight: bold"><span style="color: #0000FF">AND</span></span> <span style="color: #990000">[</span>t7<span style="color: #990000">].[</span>CustomerID<span style="color: #990000">]</span> <span style="color: #990000">=</span> <span style="color: #990000">[</span>t6<span style="color: #990000">].[</span>CustomerID<span style="color: #990000">]</span>
                     <span style="font-weight: bold"><span style="color: #0000FF">AND</span></span> <span style="color: #990000">[</span>t8<span style="color: #990000">].[</span>OrderID<span style="color: #990000">]</span> <span style="color: #990000">=</span> <span style="color: #990000">[</span>t7<span style="color: #990000">].[</span>OrderID<span style="color: #990000">])</span> <span style="font-weight: bold"><span style="color: #0000FF">AS</span></span> <span style="color: #990000">[</span>t13<span style="color: #990000">]</span>
           <span style="font-weight: bold"><span style="color: #0000FF">WHERE</span></span> <span style="color: #990000">[</span>t0<span style="color: #990000">].[</span>CustomerID<span style="color: #990000">]</span> <span style="color: #990000">=</span> N<span style="color: #FF0000">'ALFKI'</span>
        <span style="font-weight: bold"><span style="color: #0000FF">GROUP</span></span> <span style="font-weight: bold"><span style="color: #0000FF">BY</span></span> <span style="color: #990000">[</span>t13<span style="color: #990000">].[</span>ProductName<span style="color: #990000">])</span> <span style="font-weight: bold"><span style="color: #0000FF">AS</span></span> <span style="color: #990000">[</span>t14<span style="color: #990000">]</span>
<span style="font-weight: bold"><span style="color: #0000FF">ORDER</span></span> <span style="font-weight: bold"><span style="color: #0000FF">BY</span></span> <span style="color: #990000">[</span>t14<span style="color: #990000">].[</span>value<span style="color: #990000">]</span> <span style="font-weight: bold"><span style="color: #0000FF">DESC</span></span></tt></pre></div></div>
<div class="paragraph"><p><strong>Gremlin</strong></p></div>
<div class="listingblock">
<div class="content"><!-- Generator: GNU source-highlight
by Lorenzo Bettini
http://www.lorenzobettini.it
http://www.gnu.org/software/src-highlite -->
<pre><tt>gremlin&gt; g.V().has("customer", "customerId", "ALFKI").as("customer").
               out("ordered").out("contains").out("is").aggregate("products").
               in("is").in("contains").in("ordered").where(neq("customer")).
               out("ordered").out("contains").out("is").where(without("products")).
               groupCount().order(local).by(values, decr).select(keys).limit(local, 5).
               unfold().values("name")
==&gt;Gorgonzola Telino
==&gt;Guaraná Fantástica
==&gt;Camembert Pierrot
==&gt;Chang
==&gt;Jack's New England Clam Chowder</tt></pre></div></div>
<div class="paragraph"><p><strong>References:</strong></p></div>
<div class="ulist"><ul>
<li>
<p>
<a href="http://tinkerpop.apache.org/docs/current/reference/#aggregate-step">Aggregate Step</a>
</p>
</li>
<li>
<p>
<a href="http://tinkerpop.apache.org/docs/current/reference/#as-step">As Step</a>
</p>
</li>
<li>
<p>
<a href="http://tinkerpop.apache.org/docs/current/reference/#groupcount-step">GroupCount Step</a>
</p>
</li>
<li>
<p>
<a href="http://tinkerpop.apache.org/docs/current/reference/#has-step">Has Step</a>
</p>
</li>
<li>
<p>
<a href="http://tinkerpop.apache.org/docs/current/reference/#limit-step">Limit Step</a>
</p>
</li>
<li>
<p>
<a href="http://tinkerpop.apache.org/docs/current/reference/#order-step">Order Step</a>
</p>
</li>
<li>
<p>
<a href="http://tinkerpop.apache.org/docs/current/reference/#select-step">Select Step</a>
</p>
</li>
<li>
<p>
<a href="http://tinkerpop.apache.org/docs/current/reference/#unfold-step">Unfold Step</a>
</p>
</li>
<li>
<p>
<a href="http://tinkerpop.apache.org/docs/current/reference/#vertex-steps">Vertex Steps</a>
</p>
</li>
<li>
<p>
<a href="http://tinkerpop.apache.org/docs/current/reference/#where-step">Where Step</a>
</p>
</li>
<li>
<p>
<a href="http://tinkerpop.apache.org/docs/current/reference/#a-note-on-predicates">A Note on Predicates</a>
</p>
</li>
</ul></div>
</div>
</div>
</div>
</div>


<h2>3.3: Proposed Reference Architecture </h2>
<img src="https://github.com/salimngit/DeepFin-Series-JPMorgan/raw/master/images/Proposed_Reference_Architecture.PNG" />

<h2>3.4: Data Source </h2>
<table>
  <tr><td><img src="https://archive.ics.uci.edu/ml/assets/logo.gif"/>


        <table border=1 cellpadding=6>
	<tr>
		<td bgcolor="#DDEEFF"><p class="normal"><b>Data Set Characteristics:&nbsp;&nbsp;</b></p></td>
		<td><p class="normal">Multivariate</p></td>
		<td bgcolor="#DDEEFF"><p class="normal"><b>Number of Instances:</b></p></td>
		<td><p class="normal">422937</p></td>
		<td bgcolor="#DDEEFF"><p class="normal"><b>Area:</b></p></td>
		<td><p class="normal">N/A</p></td>
	</tr>

	<tr>
		<td bgcolor="#DDEEFF"><p class="normal"><b>Attribute Characteristics:</b></p></td>
		<td><p class="normal">N/A</p></td>
		<td bgcolor="#DDEEFF"><p class="normal"><b>Number of Attributes:</b></p></td>
		<td><p class="normal">5</p></td>
		<td bgcolor="#DDEEFF"><p class="normal"><b>Date Donated</b></p></td>
		<td><p class="normal">2016-02-28</p></td>
	</tr>
	<tr>
		<td bgcolor="#DDEEFF"><p class="normal"><b>Associated Tasks:</b></p></td>
		<td><p class="normal">Classification, Clustering</p></td>
		<td bgcolor="#DDEEFF"><p class="normal"><b>Missing Values?</b></p></td>
		<td><p class="normal">N/A</p></td>
		<td bgcolor="#DDEEFF"><p class="normal"><b>Number of Web Hits:</b></p></td>
		<td><p class="normal">64924</p></td>
	</tr>
	<!--
	<tr>

		<td bgcolor="#DDEEFF"><p class="normal"><b>Highest Percentage Achieved:&nbsp;&nbsp;</b></p></td>
		<td><p class="normal">N/A</p></td>
	</tr>
	-->
</table>  
    <p class="small-heading"><b>Data Set Information:</b></p>
<p class="normal">News are grouped into clusters that represent pages discussing the same news story. 
<br>The dataset includes also references to web pages that, at the access time, pointed (has a link to) one of the news page in the collection.
<br>
<br>422937 news pages and divided up into:
<br>
<br>152746 	news of business category
<br>108465 	news of science and technology category
<br>115920 	news of business category
<br> 45615 	news of health category
<br>
<br>2076 clusters of similar news for entertainment category
<br>1789 clusters of similar news for science and technology category
<br>2019 clusters of similar news for business category
<br>1347 clusters of similar news for health category
<br>
<br>References to web pages containing a link to one news included in the collection are also included. They are represented as pairs of urls corresponding to 2-page browsing sessions. The collection includes 15516 2-page browsing sessions covering 946 distinct clusters divided up into:
<br>
<br>6091 2-page sessions for business category
<br>9425 2-page sessions for entertainment category</p>
    </td></tr>
</table>

In [6]:
%sql 
SELECT *  FROM uci_news_aggregator LIMIT 5

ID,TITLE,URL,PUBLISHER,CATEGORY,STORY,HOSTNAME,TIMESTAMP
1,"Fed official says weak data caused by weather, should not slow taper","http://www.latimes.com/business/money/la-fi-mo-federal-reserve-plosser-stimulus-economy-20140310,0,1312750.story\?track=rss",Los Angeles Times,b,ddUyU0VZz0BRneMioxUPQVP6sIxvM,www.latimes.com,1394470370698
2,Fed's Charles Plosser sees high bar for change in pace of tapering,http://www.livemint.com/Politics/H2EvwJSK2VE6OF7iK1g3PP/Feds-Charles-Plosser-sees-high-bar-for-change-in-pace-of-ta.html,Livemint,b,ddUyU0VZz0BRneMioxUPQVP6sIxvM,www.livemint.com,1394470371207
3,US open: Stocks fall after Fed official hints at accelerated tapering,http://www.ifamagazine.com/news/us-open-stocks-fall-after-fed-official-hints-at-accelerated-tapering-294436,IFA Magazine,b,ddUyU0VZz0BRneMioxUPQVP6sIxvM,www.ifamagazine.com,1394470371550
4,"Fed risks falling 'behind the curve', Charles Plosser says",http://www.ifamagazine.com/news/fed-risks-falling-behind-the-curve-charles-plosser-says-294430,IFA Magazine,b,ddUyU0VZz0BRneMioxUPQVP6sIxvM,www.ifamagazine.com,1394470371793
5,Fed's Plosser: Nasty Weather Has Curbed Job Growth,http://www.moneynews.com/Economy/federal-reserve-charles-plosser-weather-job-growth/2014/03/10/id/557011,Moneynews,b,ddUyU0VZz0BRneMioxUPQVP6sIxvM,www.moneynews.com,1394470372027


<h2>3.5 Industrial-Strength Natural Language Processing</h2>
<table><tr>
<td align="left"><img src="https://upload.wikimedia.org/wikipedia/commons/thumb/8/88/SpaCy_logo.svg/1200px-SpaCy_logo.svg.png" width="300pt" /></td>
<td>
<div class="o-grid__col o-grid__col--third"><h2 class="u-heading-2 u-heading">Features</h2><ul class="c-list c-list--bullets o-block u-text"><li class="c-list__item">Non-destructive <strong>tokenization</strong></li><li class="c-list__item"><strong>Named entity</strong> recognition</li><li class="c-list__item">Support for <strong>34+ languages</strong></li><li class="c-list__item"><strong>13 statistical models</strong> for 8 languages</li><li class="c-list__item">Pre-trained <strong>word vectors</strong></li><li class="c-list__item">Easy <strong>deep learning</strong> integration</li><li class="c-list__item">Part-of-speech tagging</li><li class="c-list__item">Labelled dependency parsing</li><li class="c-list__item">Syntax-driven sentence segmentation</li><li class="c-list__item">Built in <strong>visualizers</strong> for syntax and NER</li><li class="c-list__item">Convenient string-to-hash mapping</li><li class="c-list__item">Export to numpy data arrays</li><li class="c-list__item">Efficient binary serialization</li><li class="c-list__item">Easy <strong>model packaging</strong> and deployment</li><li class="c-list__item">State-of-the-art speed</li><li class="c-list__item">Robust, rigorously evaluated accuracy</li></ul></div></td>
</tr>
<table>

In [8]:
import os
import pandas as pd
import spacy 
import requests
import operator
import uuid
import numpy as np
from collections import Counter 

model_dir = spacy.util.get_data_path() 
if not os.path.exists(os.path.join(model_dir.as_posix(), "en")): 
  spacy.cli.download("en")
 
nlp = spacy.load("en")

<h2>3.6: Spark User Defined Functions & Processing Pipelines</h2>

<h3>3.6.1: User Defined Functions</h3>
>User-Defined Functions (aka UDF) is a feature of Spark SQL to define new Column-based functions that extend the vocabulary of Spark SQL’s DSL for transforming Datasets.

<h3>Processing Pipeline</h3>
<img src="https://spacy.io/assets/img/pipeline.svg" />

Part of Speech Tagging
e.g. ***Apple is looking at buying U.K. startup for $1 billion***

|**Token:**|Apple|is|looking|at|buying|U.K.|startup|for|$|1|billion|
| :--| :--| :--| :--| :--| :--| :--| :--| :--| :--| :--|
|***Part Of Speech:***|PROPN|VERB|VERB|ADP|VERB|PROPN|NOUN|ADP|SYM|NUM|NUM|
|***Spacy TAG:***|NNP|VBZ|VBG|IN|VBG|NNP|NN|IN|$|CD|CD|

<h4>Named Entity Recognition</h4>
>A named entity is a "real-world object" that's assigned a name – for example, a person, a country, a product or a book title. spaCy can recognise various types of named entities in a document, by asking the model for a prediction. Because models are statistical and strongly depend on the examples they were trained on, this doesn't always work perfectly and might need some tuning later, depending on your use case.
****source:http://spacy.io****

|Text|Start|End|Label|Description|
| :--| :--|:--|:--|:--|
|Apple|0|5|ORG|Companies, agencies, institutions.|
|U.K.|27|31|GPE|Geopolitical entity, i.e. countries, cities, states.|
|$1 billion|44|54|MONEY|Monetary values, including unit.|

>To learn more about entity recognition in spaCy, how to add your own entities to a document and how to train and update the entity predictions of a model, see the usage guides on named entity recognition and training the named entity recognizer.

In [10]:
from pyspark.sql.functions import *
from pyspark.sql.types import *

def get_entities_udf():
    def get_entities(text):
        global nlp        
        doc = nlp(str(text))
        
        return [t.text for t in doc.ents]
    res_udf = udf(get_entities, ArrayType(StringType()))
    return res_udf

def get_pos_chain_udf():
  def get_pos_chain(text):
        global nlp        
        doc = nlp(str(text))
                    
        return "-".join([d.tag_ for d in doc]) 
  res_udf = udf(get_pos_chain, StringType())
  return res_udf
  

def get_verb(token, omittdescription):
    """Check verb type given spacy token"""
    
    if token.pos_ == 'VERB':
        indirect_object = False
        direct_object = False
        for item in token.children:
            if(item.dep_ == "iobj" or item.dep_ == "pobj"):
                indirect_object = True
            if (item.dep_ == "dobj" or item.dep_ == "dative"):
                direct_object = True
        if indirect_object and direct_object:
            description = 'DITRANVERB'
            token_text = token.text
        elif direct_object and not indirect_object:
            description = 'TRANVERB'
            token_text = token.text
        elif not direct_object and not indirect_object:
            description = 'INTRANVERB'
            token_text = token.text
        else:
            description = 'VERB'
            token_text = token.text    
        
        #return based on function settings
        if omittdescription :
            return token_text
        else:
            return (description, token_text)
            
def get_verbs_udf():
  def get_verbs(title_text,omittdescription=True):    
    global nlp        
    doc = nlp(str(title_text))
    
    verbs = []
    for token in doc:
        verb = get_verb(token, omittdescription)
        if verb is not None:
            verbs.append(verb)
    return verbs
  
  res_udf = udf(get_verbs, ArrayType(StringType()))
  return res_udf

def get_concept_probability(entity_name):
    response = requests.get('https://concept.research.microsoft.com/api/Concept/ScoreByProb?instance={}&topK=5' \
                            .format(entity_name))
    
    if response.status_code == 200:
        if len(response.json().items()) > 0 :
            concept_dict = response.json() #assign response to variable to use as dictionary            
            max_concept = sorted(concept_dict.items(), key=operator.itemgetter(1), reverse=True)[0]                        
            return [max_concept[0], format(max_concept[1], '.2f')]
            
    return None
  
uuidUdf= udf(lambda : str(uuid.uuid4()),StringType()).asNondeterministic()

In [11]:
df = spark.sql("SELECT *  FROM uci_news_aggregator LIMIT 100") #LIMIT 1000
df.cache()
df.count()

<h2>4: Load and Process Data </h2>

>For transformations, Spark adds them to a DAG of computation and only when driver requests some data, does this DAG actually gets executed. 
One advantage of this is that Spark can make many optimization decisions after it had a chance to look at the DAG in entirety. This would not be possible if it executed everything as soon as it got it.
For example -- if you executed every transformation eagerly, what does that mean? Well, it means you will have to materialize that many intermediate datasets in memory. This is evidently not efficient -- for one, it will increase your GC costs. (Because you're really not interested in those intermediate results as such. Those are just convnient abstractions for you while writing the program.) So, what you do instead is -- you tell Spark what is the eventual answer you're interested and it figures out best way to get there.
***Source***: [stackoverflow](https://stackoverflow.com/questions/38027877/spark-transformation-why-its-lazy-and-what-is-the-advantage)
<table>
  <tr>
    <td><b>Distributed runtime</b></td>
    <td><b>Execution Plan</b></td>
  </tr>
  <tr>
    <td><img src="https://i.stack.imgur.com/sqke5.png"/></td>
    <td><img src="https://jaceklaskowski.gitbooks.io/mastering-apache-spark/diagrams/stage-tasks.png"/></td></tr>
</table>

In [13]:
df = df.withColumn("title_pos_chain", get_pos_chain_udf()("TITLE"))
df = df.withColumn("named_entities", get_entities_udf()("TITLE"))
df = df.withColumn("title_verbs", get_verbs_udf()("TITLE"))

df.cache()
df.show()

In [14]:
#preview tuples 
pos_chain_1 = "NNP-VBZ-NNP"

example_events_df = (df
                .where(df["title_pos_chain"].contains(pos_chain_1))
                .filter(size(df["named_entities"]) == 2)).rdd.collect()

graphbase_list = [{'TITLE' : x["TITLE"], 'named_entities' : x["named_entities"], 'title_verbs' : x["title_verbs"]}
                  for x in example_events_df]

for r in graphbase_list:    
  print (r["TITLE"])
  print ('named entities:{} verbs:{}'.format(r["named_entities"], r["title_verbs"]))  
  print ('entity types: {}'.format([ get_concept_probability(entity_name) for entity_name in r["named_entities"]] ))
  print ()    


<table>
  <tr>
<td><img src="https://henningkropponlinede.files.wordpress.com/2016/12/spark-broadcast.png?w=640"/></td>
<td>
Spark SQL uses broadcast join (aka broadcast hash join) instead of hash join to optimize join queries when the size of one side data is below spark.sql.autoBroadcastJoinThreshold.

Broadcast join can be very efficient for joins between a large table (fact) with relatively small tables (dimensions) that could then be used to perform a star-schema join. It can avoid sending all data of the large table over the network.
    </td></tr></table>

In [16]:
# extract core entities
publishers = df.select("PUBLISHER").distinct().withColumn("publisher_id", uuidUdf()).select("publisher_id","PUBLISHER")
timestamps = df.select("TIMESTAMP").distinct().withColumn("Date", to_date(from_unixtime(col("TIMESTAMP") / 1000))).select("Date").distinct().withColumn("date_id", uuidUdf()).select("date_id","Date")

publishers.persist()
timestamps.persist()

In [17]:
pos_chain_1 = "NNP-VBZ-NNP"

get_entity_type_udf = udf(get_concept_probability, ArrayType(StringType()))

base_df = (df
           .where(df["title_pos_chain"].contains(pos_chain_1))
           .filter(size(df["named_entities"]) == 2)
           .withColumn("event_id", uuidUdf() )           
           .withColumn("named_entity1", col("named_entities")[0] )
           .withColumn("named_entity2", col("named_entities")[1] )
           .withColumn("named_entity1_id", uuidUdf() )
           .withColumn("named_entity2_id", uuidUdf() )
           .withColumn("named_entity1_type", get_entity_type_udf("named_entity1")[0])
           .withColumn("named_entity1_type_probability", get_entity_type_udf("named_entity1")[1])
           .withColumn("named_entity2_type", get_entity_type_udf("named_entity2")[0])
           .withColumn("named_entity2_type_probability", get_entity_type_udf("named_entity2")[1])
           .withColumn("Date", to_date(from_unixtime(col("TIMESTAMP") / 1000)))
           .select("ID","TITLE", "PUBLISHER", "URL", "Date", "event_id", "title_verbs", \
                   "named_entity1_id", "named_entity1","named_entity2","named_entity1_type", "named_entity1_type_probability", \
                   "named_entity2_id", "named_entity2_type","named_entity1_type","named_entity2_type", "named_entity2_type_probability")
           .where(col("named_entity1_type").isNotNull() & col("named_entity2_type").isNotNull())
          )

graphbase_df = base_df.join(publishers, ['PUBLISHER'] )
graphbase_df = graphbase_df.join(timestamps, ['Date'] )
graphbase_df.cache()
graphbase_df.printSchema()

<h2>5: Graph Model</h2>


<table>
  <tr>
    <td><img src="https://github.com/salimngit/DeepFin-Series-JPMorgan/raw/master/images/Knowledge_Graph_Basics.PNG" width="80%" /></td>
    <td>
      <ul>
        <li><b>Entity</b></li>

        <li><b>Predicate</b><br/>
      Relation between two connected entities</li>

        <li><b>CVT (Compound Value Type)</b><br/>
      Not a real-world entity, but is used to collect multiple fields of an event</li>

        <li><b>Fact</b><br/>
      Triple, which connects two entities
      Event, which connects multiple entities via a CVT node</li>

      </ul>
    </td>
  </tr>
</table>
 

<h3>5.1: Article Graph Model</h3>
<img src="https://github.com/salimngit/DeepFin-Series-JPMorgan/raw/master/images/Proposed_Graph.PNG" width="60%"/>


<h3>What are GraphFrames?</h3>
<p>GraphFrames support general graph processing, similar to Apache Spark&rsquo;s GraphX library. However, GraphFrames are built on top of Spark DataFrames, resulting in some key advantages:</p>
<ul><li><strong>Python, Java &amp; Scala APIs:</strong> GraphFrames provide uniform APIs for all 3 languages. For the first time, all algorithms in GraphX are available from Python &amp; Java.</li>
<li><strong>Powerful queries:</strong> GraphFrames allow users to phrase queries in the familiar, powerful APIs of <a aria-describedby="tt" href="https://databricks.com/glossary/what-is-spark-sql" title="Glossary: Spark SQL"  class="glossaryLink " data-cmtooltip="<div class=glossaryItemTitle>Spark SQL</div><div class=glossaryItemBody>Many data scientists, analysts, and general business intelligence users rely on interactive SQL queries for exploring data. Spark SQL is a Spark(...)</div>" >Spark SQL</a> and DataFrames.</li>
<li><strong>Saving &amp; loading graphs:</strong> GraphFrames fully support <a href="http://spark.apache.org/docs/latest/sql-programming-guide.html#data-sources">DataFrame data sources</a>, allowing writing and reading graphs using many formats like <a aria-describedby="tt" href="https://databricks.com/glossary/what-is-parquet" title="Glossary: Parquet"  class="glossaryLink " data-cmtooltip="<div class=glossaryItemTitle>Parquet</div><div class=glossaryItemBody>Parquet is an open source file format available to any project in the Hadoop ecosystem. Apache Parquet is designed for efficient as well as(...)</div>" >Parquet</a>, JSON, and CSV.</li>
</ul><p>In GraphFrames, vertices and edges are represented as DataFrames, allowing us to store arbitrary data with each vertex and edge.</p>


<h3>Standard graph algorithms</h3>
<p>GraphFrames comes with a number of standard graph algorithms built in:</p>
<div class="contents local topic" id="id1">
<ul class="simple">
<li><a class="reference internal" href="#breadth-first-search-bfs" id="id6">Breadth-first search (BFS)</a></li>
<li><a class="reference internal" href="#connected-components" id="id7">Connected components</a></li>
<li><a class="reference internal" href="#strongly-connected-components" id="id8">Strongly connected components</a></li>
<li><a class="reference internal" href="#label-propagation" id="id9">Label propagation</a></li>
<li><a class="reference internal" href="#pagerank" id="id10">PageRank</a></li>
<li><a class="reference internal" href="#shortest-paths" id="id11">Shortest paths</a></li>
<li><a class="reference internal" href="#triangle-counting" id="id12">Triangle counting</a></li>
</ul>
</div>

In [19]:
from graphframes import *

#create Nodes
article_nodes = (graphbase_df            
            .withColumn("nodeType", lit("Article"))
            .withColumn("probability", lit(100))
            .select(col("ID").alias("id"),"nodeType",col("URL").alias("value"),"probability")
           )
publisher_nodes = (publishers                   
                   .withColumn("id", col("publisher_id"))
                   .withColumn("nodeType", lit("Publisher"))
                   .withColumn("probability", lit(100))
                   .select("id","nodeType",col("PUBLISHER").alias("value"),"probability")  
                  )

timestamp_nodes = (timestamps                   
                   .withColumn("id", col("date_id"))
                   .withColumn("nodeType", lit("Date"))
                   .withColumn("probability", lit(100))
                   .select("id","nodeType",col("Date").alias("value"),"probability")
                  )

named_entity1_nodes = (graphbase_df
                       .withColumn("id", col("named_entity1_id"))
                       .withColumn("nodeType", col("named_entity1_type"))
                       .select("id","nodeType",col("named_entity1").alias("value"), col("named_entity1_type_probability").alias("probability"))
                      )

named_entity2_nodes = (graphbase_df
                       .withColumn("id", col("named_entity2_id"))
                       .withColumn("nodeType", col("named_entity2_type"))
                       .select("id","nodeType",col("named_entity2").alias("value"), col("named_entity2_type_probability").alias("probability"))
                      )

CVT_event_nodes = (graphbase_df
                       .withColumn("id", col("event_id"))
                       .withColumn("nodeType", lit("@Event"))
                       .withColumn("probability", lit(100))
                       .select("id","nodeType",col("title_verbs").cast("string").alias("value"), "probability")
                       .where(col("title_verbs").isNotNull())
                      )

article_publisher_edges = (graphbase_df
                           .select(col("ID").alias("src"), col("publisher_id").alias("dst"),col("URL").alias("value"))
                          )

article_timestamp_edges = (graphbase_df
                           .select(col("ID").alias("src"), col("date_id").alias("dst"),col("Date").alias("value"))
                          )

event_named_entity1_edges = (graphbase_df
                             .select(col("named_entity1_id").alias("src"), col("event_id").alias("dst"), col("title_verbs").cast("string").alias("value") ) 
                            )

event_named_entity2_edges = (graphbase_df
                             .select(col("named_entity2_id").alias("src"), col("event_id").alias("dst"), col("title_verbs").cast("string").alias("value") ) 
                            )
CVT_event_edges = (graphbase_df                       
                       .select(col("ID").alias("src"), col("event_id").alias("dst"), col("title_verbs").cast("string").alias("value"))
                       .where(col("title_verbs").isNotNull())
                      )

named_entity_nodes = named_entity1_nodes.union(named_entity2_nodes).union(CVT_event_nodes)
event_named_entity_edges = event_named_entity1_edges.union(event_named_entity2_edges)

all_nodes = article_nodes.union(publisher_nodes).union(timestamp_nodes).union(named_entity_nodes)
all_edges = article_publisher_edges.union(article_timestamp_edges).union(event_named_entity_edges).union(CVT_event_edges)

all_nodes.write.saveAsTable("graph_nodes", mode="overwrite")
all_edges.write.saveAsTable("graph_edges", mode="overwrite")

all_nodes.cache()
all_edges.cache()

all_graph = GraphFrame(all_nodes, all_edges)

In [20]:
EDA_df = spark.sql("SELECT nodeType, COUNT(*) FROM graph_nodes GROUP BY nodeType")
display(EDA_df)

nodeType,count(1)
activist investor,3
Publisher,65
@Event,6
site,6
suspect,3
Date,1
Article,6


<h2>6: Entity Disambiguation</h2>

<p>In natural language processing, <b>named entity disambiguation</b> (NED), is the task of determining the identity of entities mentioned in text. For example, given the sentence "Paris is the capital of France", the idea is to determine that "Paris" refers to the city of Paris and not to Paris Hilton or any other entity that could be referred as "Paris". </p>

<h3>Edit Distance</h3>
<p>In computational linguistics and computer science, <b>edit distance</b> is a way of quantifying how dissimilar two strings (e.g., words) are to one another by counting the minimum number of operations required to transform one string into the other. Edit distances find applications in natural language processing, where automatic spelling correction can determine candidate corrections for a misspelled word by selecting words from a dictionary that have a low distance to the word in question. <br/> (source: wikipedia)
</p>

In [22]:
# entity disambiguity 

standard_nodes = "'Publisher', 'Article', 'site', 'Date', '@Event'"

named_entities_df = spark.sql("SELECT * from graph_nodes WHERE nodeType not in (" + standard_nodes + ")")

named_entities_df = named_entities_df.crossJoin(named_entities_df.withColumnRenamed("value" , "value2").withColumnRenamed("id" , "id2").select("value2","id2"))

named_entities_df = (named_entities_df
                     .withColumn("value", lower(col("value")))
                     .withColumn("value2", lower(col("value2")))
                     .withColumn("len_value", length(col("value")))
                     .withColumn("len_value2", length(col("value2")))
                     .withColumn("edit_distance", levenshtein("value", "value2"))
                     .withColumn("max_length", when( (col("len_value") > col("len_value2")), \
                                     col("len_value")).otherwise(col("len_value2")))
                     .withColumn("similarity_ratio", (col("max_length") - col("edit_distance")) / col("max_length") )                     
                     .where((col("similarity_ratio") >= 0.6) & (col("id") != col("id2")) )
                    )

named_entities_df.show()                       

<h2 id="features-of-azure-cosmos-db-graph-database">7: Features of Azure Cosmos DB graph database</h2>
<p>Azure Cosmos DB is a fully managed graph database that offers global distribution, elastic scaling of storage and throughput, automatic indexing and query, tunable consistency levels, and support for the TinkerPop standard.</p>
<p><img src="https://docs.microsoft.com/en-us/azure/cosmos-db/media/graph-introduction/cosmosdb-graph-architecture.png" alt="Azure Cosmos DB graph architecture" data-linktype="relative-path"></p>


<h2 id="count-vertices-in-the-graph">Count vertices in the graph</h2>
<p>The following snippet shows how to count the number of vertices in the graph:</p>
<pre><code>g.V().count()
</code></pre><h2 id="filters">Filters</h2>
<p>You can perform filters using Gremlin&#39;s <code>has</code> and <code>hasLabel</code> steps, and combine them using <code>and</code>, <code>or</code>, and <code>not</code> to build more complex filters. Azure Cosmos DB provides schema-agnostic indexing of all properties within your vertices and degrees for fast queries:</p>
<pre><code>g.V().hasLabel(&#39;person&#39;).has(&#39;age&#39;, gt(40))
</code></pre><h2 id="projection">Projection</h2>
<p>You can project certain properties in the query results using the <code>values</code> step:</p>
<pre><code>g.V().hasLabel(&#39;person&#39;).values(&#39;firstName&#39;)
</code></pre><h2 id="find-related-edges-and-vertices">Find related edges and vertices</h2>
<p>So far, we&#39;ve only seen query operators that work in any database. Graphs are fast and efficient for traversal operations when you need to navigate to related edges and vertices. Let&#39;s find all friends of Thomas. We do this by using Gremlin&#39;s <code>outE</code> step to find all the out-edges from Thomas, then traversing to the in-vertices from those edges using Gremlin&#39;s <code>inV</code> step:</p>
<pre><code class="lang-cs">g.V(&#39;thomas&#39;).outE(&#39;knows&#39;).inV().hasLabel(&#39;person&#39;)
</code></pre><p>The next query performs two hops to find all of Thomas&#39; &quot;friends of friends&quot;, by calling <code>outE</code> and <code>inV</code> two times. </p>
<pre><code class="lang-cs">g.V(&#39;thomas&#39;).outE(&#39;knows&#39;).inV().hasLabel(&#39;person&#39;).outE(&#39;knows&#39;).inV().hasLabel(&#39;person&#39;)
</code></pre><p>You can build more complex queries and implement powerful graph traversal logic using Gremlin, including mixing filter expressions, performing looping using the <code>loop</code> step, and implementing conditional navigation using the <code>choose</code> step. Learn more about what you can do with <a href="gremlin-support" data-linktype="relative-path">Gremlin support</a>!</p>

In [24]:
from urllib.parse import quote

def urlencode(value):
  return quote(value, safe="")


udf_urlencode = udf(urlencode, StringType())

def to_cosmosdb_vertices(dfVertices, labelColumn, partitionKey = ""):
  dfVertices = dfVertices.withColumn("id", udf_urlencode("id"))
  
  columns = ["id", labelColumn]
  
  if partitionKey:
    columns.append(partitionKey)
  
  columns.extend(['nvl2({x}, array(named_struct("id", uuid(), "_value", {x})), NULL) AS {x}'.format(x=x) \
                for x in dfVertices.columns if x not in columns])
 
  return dfVertices.selectExpr(*columns).withColumnRenamed(labelColumn, "label")

def to_cosmosdb_edges(g, labelColumn, partitionKey = ""): 
  dfEdges = g.edges
  
  if partitionKey:
    dfEdges = dfEdges.alias("e") \
      .join(g.vertices.alias("sv"), col("e.src") == col("sv.id")) \
      .join(g.vertices.alias("dv"), col("e.dst") == col("dv.id")) \
      .selectExpr("e.*", "sv." + partitionKey, "dv." + partitionKey + " AS _sinkPartition")

  dfEdges = dfEdges \
    .withColumn("id", udf_urlencode(concat_ws("_", col("src"), col(labelColumn), col("dst")))) \
    .withColumn("_isEdge", lit(True)) \
    .withColumn("_vertexId", udf_urlencode("src")) \
    .withColumn("_sink", udf_urlencode("dst")) \
    .withColumnRenamed(labelColumn, "label") \
    .drop("src", "dst")
  
  return dfEdges


cosmosDbVertices = to_cosmosdb_vertices(all_graph.vertices, "nodeType")
cosmosDbEdges = to_cosmosdb_edges(all_graph,"value")

display(cosmosDbVertices)
#display(cosmosDbEdges)

id,label,value,probability
52,Article,"List(List(08d7135c-06df-40c3-a0e9-ecd859935fda, http://www.news-sentinel.com/apps/pbcs.dll/article\?AID=/20140310/AP01/303109948))","List(List(0627d5b8-b446-4523-ae1c-6d715029869c, 100))"
55,Article,"List(List(51f6a6c4-36f6-4fc4-95e3-3461beb1c45a, http://www.reuters.com/article/2014/03/10/us-ebay-icahn-idUSBREA1Q1CB20140310))","List(List(f81ec312-e849-4548-9a66-e23196481abd, 100))"
66,Article,"List(List(0d85f9fd-2546-4be5-bf54-64e16e7cf32e, http://www.livemint.com/Companies/reFKQhuvTX38786ewbP6GM/EBay-rejects-Carl-Icahn-board-nominees-asks-investors-to-do.html))","List(List(6e82cfb5-339a-413a-a5ee-1b7b1e285b7c, 100))"
69,Article,"List(List(3396e3c3-7b4c-4142-ac82-6a4235910c9b, http://www.cityam.com/blog/1394460724/carl-icahn-slams-ebay-ceo))","List(List(602062f9-8479-447c-8c8d-3d540f1d48de, 100))"
73,Article,"List(List(30a8c4f5-c7dd-4759-963d-aab22a6a1325, http://www.theglobeandmail.com/report-on-business/international-business/us-business/ebay-urges-shareholders-to-vote-against-icahn-board-nominees/article17391847/))","List(List(99425757-01a7-4eb3-b803-63a87847a1bf, 100))"
77,Article,"List(List(d6cf6d1b-b0c1-4fb6-80a6-c49e022ccb85, http://www.marketwatch.com/story/ebay-says-carl-icahns-board-picks-not-qualified-2014-03-10\?link=MW_latest_news))","List(List(2f1fbc7d-7a12-46a4-b3d3-d07b2a459767, 100))"
4dbe8b72-c4b7-4102-9ae5-c77016343b14,Publisher,"List(List(cf609c53-b0a6-4172-9e06-43e1d0079ec5, Bradenton Herald))","List(List(a9e8e83b-e732-4552-9714-1bb22c765e68, 100))"
11595b42-536a-40ee-b6af-cd038720f95c,Publisher,"List(List(03a0b44e-0245-4716-b154-24810e8739e0, Ocala))","List(List(0d21b574-e48a-4432-8bfb-02211bb8638a, 100))"
1e529295-e9b1-4746-b943-2f92ab08e15a,Publisher,"List(List(1ad2b3c4-4391-4a3e-8ac6-abbba2b71e22, The Globe and Mail))","List(List(5a9ada42-1ccc-48e1-9b75-90dd41b73b4f, 100))"
72d97f22-072c-4f77-8c0f-88a9cddae884,Publisher,"List(List(8d14b144-e0a6-4c69-9758-6c4baf9e7697, IFA Magazine))","List(List(4702c070-ed03-4b24-b547-c9937846eca8, 100))"


In [25]:
#insert into CosmosDB

cosmosDbConfig = {
  "Endpoint" : "https://<yourgraphdb>.documents.azure.com:443/",
  "Masterkey" : "[PRIMARY KEY]",
  "Database" : "[DATABASE]",
  "Collection" : "[COLLECTION ID]",
  "Upsert" : "true"
}

cosmosDbFormat = "com.microsoft.azure.cosmosdb.spark"

cosmosDbVertices.write.format(cosmosDbFormat).mode("append").options(**cosmosDbConfig).save()
cosmosDbEdges.write.format(cosmosDbFormat).mode("append").options(**cosmosDbConfig).save()


<h1>8: Grelmin Query and Analysis </h1>

<p>Query plan: g.step1().step2().step3()………………..stepN()</p>

In [27]:
#display(all_graph.vertices.filter("nodeType = 'Date'"))
display(all_graph.inDegrees.orderBy(desc("inDegree")).limit(5))

id,inDegree
66b5cc69-405b-4116-965e-38ac5208f56a,6
437278e0-9693-4827-88cd-b55e948f8ae5,3
b9737c32-e1bb-42d7-85b6-365a3e82365e,3
5be844c4-b59d-462f-8320-9716b2c05158,3
d38aa9e8-a407-4c69-a317-af941a6ff058,3


<h2>9: Graph ML</h2>

<img src="https://github.com/salimngit/DeepFin-Series-JPMorgan/raw/master/images/networks_analytics.PNG" />

<h3>Beyond network analytics, looking at the knowledge graph as your features</h3>

<img src="https://cdn-images-1.medium.com/max/2000/1*KQORrLiT27fLkdZk-Pahhw.png"/>

<img src="https://cdn-images-1.medium.com/max/1600/1*hC7Sw4kbzKDEbUd9mwTVRw.png" />

(source: https://medium.com/octavian-ai/how-to-get-started-with-machine-learning-on-graphs-7f0795c83763)