# Named Entity Recognition Pipeline

El pipeline toma una URL de un feed en formato RSS, obtiene el título y descripción de los artículos en el feed, detecta las NER con un modelo pre-entrenado, y las muestra ordenadas por frecuencia de aparición.

### Versiones
Probado con:
* Almond 0.6.0
* Ammonite 1.6.7
* Scala library version **2.11.12** -- Copyright 2002-2017, LAMP/EPFL
* Java 1.8.0_282

Para ver más información ir a (Help -> About Scala Kernel)

## 1. Obtener texto

### 1.1 Importar librerías

In [6]:
import $ivy.`org.scalaj::scalaj-http:2.4.2`

[32mimport [39m[36m$ivy.$                              [39m

In [7]:
import $ivy.`org.scalaj::scalaj-http:2.4.2`
// "org.scala-lang.modules" %% "scala-xml" % "1.3.0"
import $ivy.`org.scala-lang.modules::scala-xml:1.3.0`

[32mimport [39m[36m$ivy.$                              
// "org.scala-lang.modules" %% "scala-xml" % "1.3.0"
[39m
[32mimport [39m[36m$ivy.$                                        [39m

In [8]:
import scalaj.http.{Http, HttpResponse}
import scala.xml.XML

[32mimport [39m[36mscalaj.http.{Http, HttpResponse}
[39m
[32mimport [39m[36mscala.xml.XML[39m

In [9]:
import $ivy.`org.json4s::json4s-jackson:3.4.0`
import org.json4s.JsonDSL._
import org.json4s._
import org.json4s.jackson.JsonMethods._
implicit val formats = DefaultFormats

[32mimport [39m[36m$ivy.$                                 
[39m
[32mimport [39m[36morg.json4s.JsonDSL._
[39m
[32mimport [39m[36morg.json4s._
[39m
[32mimport [39m[36morg.json4s.jackson.JsonMethods._
[39m
[36mformats[39m: [32mDefaultFormats[39m.type = org.json4s.DefaultFormats$@602878fb

In [10]:
import scala.collection.mutable.ListBuffer

[32mimport [39m[36mscala.collection.mutable.ListBuffer[39m

### 1.1 Obtener el texto del RSS Feed y REDDIT Feed

Creamos una clase la cual toma un argumento que definira como se usara.
Consta de un metodo queryURL(url: String) ---> Seq[String] el cual recibe una url y devuelve una lista de texto, la cual tiene los datos de dicha url.

Para RSS:

Realizamos una consulta HTTP, que nos devuelve una instancia de HTTPResponse. Dentro del atributo `body` de la HTTPResponse, se encuentra el texto del feed en formato XML. Luego, se parsea el XML para extraer los campos `title` y `description`.

Para REDDIT:

Realizamos una consulta HTTP, que nos devuelve una instancia de HTTPResponse.Dentro del atributo `body` de la HTTPResponse, se encuentra el texto del feed en formato XML.Luego, se parsea con JSON para extraer los campos `title` y `selftext`.

In [132]:
val url1 = "https://www.reddit.com/r/Android/hot/.json?count=10"
val url2 = "https://www.chicagotribune.com/arcio/rss/category/sports/?query=display_date:[now-2d+TO+now]&sort=display_date:desc"

[36murl1[39m: [32mString[39m = [32m"https://www.reddit.com/r/Android/hot/.json?count=10"[39m
[36murl2[39m: [32mString[39m = [32m"https://www.chicagotribune.com/arcio/rss/category/sports/?query=display_date:[now-2d+TO+now]&sort=display_date:desc"[39m

In [133]:
abstract class URL_R{
    
    //Request HTTP
    def get_body(url: String): String = {
        try {
            Http(url).timeout(connTimeoutMs = 2000, readTimeoutMs = 5000).asString.body
        }
        catch{
            case e: Exception => ""   
        }
    }
    
    def parser(url: String): Seq[String]
}

defined [32mclass[39m [36mURL_R[39m

In [134]:
class RSS_Parse extends URL_R {
    //Parse Rss
    def parser(url: String): Seq[String] = {
        val response = get_body(url)
        response match {
            case "" => Seq()
            case _ => val xml = XML.loadString(response)
                        // Extract text from title and description
                        (xml \\ "item").map { item => ((item \ "title").text + " " + (item \ "description").text) }
        }
        
    }
}

defined [32mclass[39m [36mRSS_Parse[39m

In [135]:
class REDDIT_Parse extends URL_R {
    // Parse Reddit
    def parser(url: String): Seq[String] = {
        val response = get_body(url)
        response match {
            case "" => Seq()
            case _ => val result = (parse(response) \ "data" \ "children" \ "data")
                         .extract[List[Map[String, Any]]]
                    // Parsear JSON
                    val filterContent = result.flatten.filter{case (v , _) => v == "title" || v == "selftext" }.map(x => x._2.toString)
                    val pattern = "(https?|ftp|file)://[-a-zA-Z0-9+&@#/%?=~_|!:,.;]*[-a-zA-Z0-9+&@#/%=~_|]".r
                    filterContent.map(x => pattern.replaceAllIn(x,"")).toSeq
        }
    }
}

defined [32mclass[39m [36mREDDIT_Parse[39m

In [136]:
val rss = new RSS_Parse()
val rssT = rss.parser(url2)

[36mrss[39m: [32mRSS_Parse[39m = ammonite.$sess.cmd133$Helper$RSS_Parse@5893e6ce
[36mrssT[39m: [32mSeq[39m[[32mString[39m] = [33mList[39m(
  [32m"4 things we heard from Chicago Bears defensive coaches, including the competition at cornerback and Eddie Goldman\u2019s return from an opt-out year New Chicago Bears defensive coordinator Sean Desai and his revamped coaching staff will get their first chance to work with players on the field this weekend at rookie minicamp."[39m,
  [32m"After the Chicago Cubs won a team Gold Glove Award last season, a look at 4 reasons why their defense still is working to get back to elite form As the Chicago Cubs try to return to .500 this weekend in Detroit, it\u2019s worth delving into a few factors manager David Ross identified that he believes have affected the team\u2019s defensive performance."[39m,
  [32m"Prospect Henrik Borgstr\u00f6m already has personal connections to the Chicago Blackhawks \u2014 but the ones he\u2019s building 

In [137]:
val reddit = new REDDIT_Parse()
val redditT = reddit.parser(url1)

[36mreddit[39m: [32mREDDIT_Parse[39m = ammonite.$sess.cmd134$Helper$REDDIT_Parse@48123f5
[36mredditT[39m: [32mSeq[39m[[32mString[39m] = [33mList[39m(
  [32m"""**Credits to the team at /r/PickAnAndroidForMe for compiling this information:**

* Home - 

* Smartphones 101 - 

* Top Phones - 


***        
Note 1. Join us at /r/MoronicMondayAndroid, a sub serving as a repository for our retired weekly threads. Just pick any thread and Ctrl-F your way to wisdom! 

Note 2. Join our IRC, and Telegram chat-rooms! [Please see our wiki for instructions.]()"""[39m,
  [32m"What should I buy Thursday (May 13 2021) - Your weekly device inquiry thread!"[39m,
  [32m"""Hey /r/android, we had conducted a [feedback poll]() last year. We have decided to make a few revisions (mainly a preferential voting system) since some of the results were rather open-ended and did not present a clear solution. Participation on the last poll was also lower than anticipated since we only got about [930 r

## 2. Detectar las entidades nombradas

### 2.1 Crear la clase encargada del modelo,contar,y ordenar Entidades

Los metodos de esta clase son :

getNEs(textList: Seq[String]) ---> Seq[Seq[String]] el cual que recibe una lista de textos y aplica el modelo, para cada texto, se separa las palabras del texto usando los espacios, y considera que es una entidad nombrada si empieza con mayúscula.

count(textL : Seq[Seq[String]]) ---> Map[String,Int] el cual recibe el resultado de aplicar el modelo, devuelve la entidad junto las cantidad de veces que aparece nombrada

sort(counts: Map[String,Int]) ---> List[(String, Int)] el cual ordena el resultado de contar las entidades

In [72]:
class NERModel() {
    // Variables Necesarias para crear modelo
    private val STOPWORDS = Seq (
        "i", "me", "my", "myself", "we", "our", "ours", "ourselves", "you",
        "yours", "yourself", "yourselves", "he", "him", "his", "himself", "she",
        "her", "hers", "herself", "it", "its", "itself", "they", "them", "your",
        "their", "theirs", "themselves", "what", "which", "who", "whom",
        "this", "that", "these", "those", "am", "is", "are", "was", "were",
        "be", "been", "being", "have", "has", "had", "having", "do", "does",
        "did", "doing", "a", "an", "the", "and", "but", "if", "or",
        "because", "as", "until", "while", "of", "at", "by", "for", "with",
        "about", "against", "between", "into", "through", "during", "before",
        "after", "above", "below", "to", "from", "up", "down", "in", "out",
        "off", "over", "under", "again", "further", "then", "once", "here",
        "there", "when", "where", "why", "how", "all", "any", "both", "each",
        "few", "more", "most", "other", "some", "such", "no", "nor", "not",
        "only", "own", "same", "so", "than", "too", "very", "s", "t", "can",
        "will", "just", "don", "should", "now", "on",
        // Contractions without '
        "im", "ive", "id", "Youre", "youd", "youve",
        "hes", "hed", "shes", "shed", "itd", "were", "wed", "weve",
        "theyre", "theyd", "theyve",
        "shouldnt", "couldnt", "musnt", "cant", "wont",
        // Common uppercase words
        "hi", "hello"
    )
    private val punctuationSymbols = ".,()!?;:'`´\n"
    private val punctuationRegex = "\\" + punctuationSymbols.split("").mkString("|\\")
    
    // Aplicar el Modelo a los datos (simplemente es aplicar la funcion a la lista de textos)
    private def getNEsSingle(text: String): Seq[String] =
      text.replaceAll(punctuationRegex, "").split(" ")
        .filter { word:String => word.length > 1 &&
                  Character.isUpperCase(word.charAt(0)) &&
                  !STOPWORDS.contains(word.toLowerCase) }.toSeq

    def getNEs(textList: Seq[String]): Seq[Seq[String]] = textList.map(getNEsSingle)
    
    // Contar las entidades
    def count(textL : Seq[Seq[String]]): Map[String,Int] = {
            textL.flatten.foldLeft(Map.empty[String, Int]) {
             (count, word) => count + (word -> (count.getOrElse(word, 0) + 1)) }
    }
    //Ordenar Entidades
    def sort(counts: Map[String,Int]): List[(String, Int)] = {
        counts.toList
          .sortBy(_._2)(Ordering[Int].reverse)
    }
}
       

defined [32mclass[39m [36mNERModel[39m

In [14]:
val model = new NERModel

[36mmodel[39m: [32mNERModel[39m = ammonite.$sess.cmd12$Helper$NERModel@2e0a535e

### 2.2 Aplicar el "Modelo" a los datos

In [15]:
val result_rss = model.getNEs(rssT)
val result_reddit = model.getNEs(redditT)

[36mresult_rss[39m: [32mSeq[39m[[32mSeq[39m[[32mString[39m]] = [33mList[39m(
  [33mArrayBuffer[39m(
    [32m"Chicago"[39m,
    [32m"White"[39m,
    [32m"Sox"[39m,
    [32m"Minnesota"[39m,
    [32m"Twins"[39m,
    [32m"Chicago"[39m,
    [32m"White"[39m,
    [32m"Sox"[39m,
    [32m"Minnesota"[39m,
    [32m"Twins"[39m,
    [32m"Thursday"[39m,
    [32m"Guaranteed"[39m,
    [32m"Rate"[39m,
    [32m"Field"[39m,
    [32m"Lance"[39m,
    [32m"Lynn"[39m
  ),
  [33mArrayBuffer[39m(
    [32m"Photos"[39m,
    [32m"Chicago"[39m,
    [32m"White"[39m,
    [32m"Sox"[39m,
    [32m"Minnesota"[39m,
    [32m"Twins"[39m,
    [32m"Photos"[39m,
    [32m"Chicago"[39m,
    [32m"White"[39m,
    [32m"Soxs"[39m,
    [32m"Minnesota"[39m,
    [32m"Twins"[39m,
    [32m"May"[39m,
    [32m"Guaranteed"[39m,
    [32m"Rate"[39m,
    [32m"Field"[39m,
    [32m"Sox"[39m
  ),
  [33mArrayBuffer[39m([32m"Gleyber"[39m, [32m"Torres"[39m, [32m"

## 3. Contar y ordenar las entidades

Concatenar todas las listas, contar cada Named Entity, y luego ordernar por frecuencia

In [16]:
val count_rss = model.count(result_rss)
val count_reddit = model.count(result_reddit)

[36mcount_rss[39m: [32mMap[39m[[32mString[39m, [32mInt[39m] = [33mMap[39m(
  [32m"Rate"[39m -> [32m6[39m,
  [32m"Nets"[39m -> [32m3[39m,
  [32m"Gleyber"[39m -> [32m2[39m,
  [32m"Corey"[39m -> [32m2[39m,
  [32m"Polisky"[39m -> [32m6[39m,
  [32m"Operations"[39m -> [32m1[39m,
  [32m"Parker"[39m -> [32m2[39m,
  [32m"Crawford"[39m -> [32m2[39m,
  [32m"Bieber"[39m -> [32m2[39m,
  [32m"OC"[39m -> [32m1[39m,
  [32m"Yankee"[39m -> [32m1[39m,
  [32m"Sky"[39m -> [32m2[39m,
  [32m"Washington"[39m -> [32m1[39m,
  [32m"Robert"[39m -> [32m1[39m,
  [32m"Dach\u2019s"[39m -> [32m1[39m,
  [32m"Hockey"[39m -> [32m1[39m,
  [32m"DeBrincat\u2019s"[39m -> [32m1[39m,
  [32m"President"[39m -> [32m1[39m,
  [32m"Alex"[39m -> [32m1[39m,
  [32m"Abreu\u2019s"[39m -> [32m1[39m,
  [32m"Shane"[39m -> [32m2[39m,
  [32m"Finnish"[39m -> [32m1[39m,
  [32m"COVID"[39m -> [32m2[39m,
  [32m"Henrik"[39m -> [32m2[39m,
  [32

In [17]:
val sort_rss = model.sort(count_rss)
val sort_reddit = model.sort(count_reddit)

[36msort_rss[39m: [32mList[39m[([32mString[39m, [32mInt[39m)] = [33mList[39m(
  ([32m"Chicago"[39m, [32m46[39m),
  ([32m"White"[39m, [32m18[39m),
  ([32m"Sox"[39m, [32m18[39m),
  ([32m"Minnesota"[39m, [32m12[39m),
  ([32m"Twins"[39m, [32m12[39m),
  ([32m"Blackhawks"[39m, [32m11[39m),
  ([32m"Photos"[39m, [32m8[39m),
  ([32m"Cubs"[39m, [32m7[39m),
  ([32m"Bears"[39m, [32m7[39m),
  ([32m"Rate"[39m, [32m6[39m),
  ([32m"Polisky"[39m, [32m6[39m),
  ([32m"Mike"[39m, [32m6[39m),
  ([32m"Guaranteed"[39m, [32m6[39m),
  ([32m"Field"[39m, [32m6[39m),
  ([32m"NFL"[39m, [32m6[39m),
  ([32m"Bulls"[39m, [32m6[39m),
  ([32m"Patrick"[39m, [32m5[39m),
  ([32m"May"[39m, [32m5[39m),
  ([32m"Northwestern"[39m, [32m5[39m),
  ([32m"Tuesday"[39m, [32m4[39m),
  ([32m"COVID-19"[39m, [32m4[39m),
  ([32m"Fields"[39m, [32m4[39m),
  ([32m"Soxs"[39m, [32m4[39m),
  ([32m"Justin"[39m, [32m4[39m),
  ([32m"Clevelan

## 4. FeedService


In [83]:
class FeedService(){
    //Variable donde almacenar las suscripciones
    
    private val buffer = new ListBuffer[(String,URL_R)]()
    private val word = "%s".r
    //Guardar un registro de las URL, y opcionalmente sus parámetros, suscriptas.
    def suscribe(url: String, param: List[String], t: URL_R) : Unit= {
        param match {
            case Nil =>
                buffer.append((url,t))
            case param =>
                val url_t = param.map(x => word.replaceFirstIn(url,x)).map(x => (x,t))
                buffer ++= url_t
        }
    }
    
    //Obtener los feeds
    def get_feed(): Seq[Seq[String]] = {
        val toSeq = buffer.toSeq
        toSeq.map{ x => x._2.parser(x._1)}
    }
    
    //Compilar el resultado de cada una en una única lista
    def get_result(feed: Seq[Seq[String]]): Seq[String] = {
        feed.flatten
    }
    
}

defined [32mclass[39m [36mFeedService[39m

In [84]:
val feed_service = new FeedService()
val url3 = "https://rss.nytimes.com/services/xml/rss/nyt/%s.xml"
val url4 = "https://www.chicagotribune.com/arcio/rss/category/%s/?query=display_date:[now-2d+TO+now]&sort=display_date:desc"
val url5 = "https://www.reddit.com/r/%s/hot/.json?count=10"

[36mfeed_service[39m: [32mFeedService[39m = ammonite.$sess.cmd82$Helper$FeedService@7b225695
[36murl3[39m: [32mString[39m = [32m"https://rss.nytimes.com/services/xml/rss/nyt/%s.xml"[39m
[36murl4[39m: [32mString[39m = [32m"https://www.chicagotribune.com/arcio/rss/category/%s/?query=display_date:[now-2d+TO+now]&sort=display_date:desc"[39m
[36murl5[39m: [32mString[39m = [32m"https://www.reddit.com/r/%s/hot/.json?count=10"[39m

In [85]:
feed_service.suscribe(url3, List[String]("Business","Technology"), new RSS_Parse())
feed_service.suscribe(url4, List[String]("sports","business"), new RSS_Parse())
feed_service.suscribe(url5, List[String]("Android"), new REDDIT_Parse())
feed_service.suscribe(url2, List[String](), new RSS_Parse())

In [86]:
val get_feed = feed_service.get_feed

[36mget_feed[39m: [32mSeq[39m[[32mSeq[39m[[32mString[39m]] = [33mList[39m(
  [33mList[39m(
    [32m"Looking for Bipartisan Accord? Just Ask About Big Business. In surveys and political discourse, Republicans are increasingly critical of corporations, but not for the reasons Democrats have long held that view.Delta is among the corporations that have drawn fire from Republicans for taking stands against Georgia\u2019s new voting law after protests."[39m,
    [32m"He\u2019s a Dogecoin Millionaire. And He\u2019s Not Selling. Glauber Contessoto went looking for something that could change his fortunes overnight. He found it in a joke cryptocurrency.Glauber Contessoto believes deeply in Dogecoin."[39m,
    [32m"Retail Sales Were Flat in April Retail sales held steady in April after rising 10.7 percent the previous month, as Americans continued to spend government stimulus payments.Shoppers in New York this week. Retail sales were flat in April from the prior month."[39m,
 

In [88]:
val get_result = feed_service.get_result(get_feed)

[36mget_result[39m: [32mSeq[39m[[32mString[39m] = [33mList[39m(
  [32m"Looking for Bipartisan Accord? Just Ask About Big Business. In surveys and political discourse, Republicans are increasingly critical of corporations, but not for the reasons Democrats have long held that view.Delta is among the corporations that have drawn fire from Republicans for taking stands against Georgia\u2019s new voting law after protests."[39m,
  [32m"He\u2019s a Dogecoin Millionaire. And He\u2019s Not Selling. Glauber Contessoto went looking for something that could change his fortunes overnight. He found it in a joke cryptocurrency.Glauber Contessoto believes deeply in Dogecoin."[39m,
  [32m"Retail Sales Were Flat in April Retail sales held steady in April after rising 10.7 percent the previous month, as Americans continued to spend government stimulus payments.Shoppers in New York this week. Retail sales were flat in April from the prior month."[39m,
  [32m"How to Navigate a Hot Housing 

In [89]:
val model_N = new NERModel

[36mmodel_N[39m: [32mNERModel[39m = ammonite.$sess.cmd71$Helper$NERModel@10b03898

In [90]:
val do_model = model_N.getNEs(get_result)
val do_count = model_N.count(do_model)
val do_sort = model_N.sort(do_count)

[36mdo_model[39m: [32mSeq[39m[[32mSeq[39m[[32mString[39m]] = [33mList[39m(
  [33mArrayBuffer[39m(
    [32m"Looking"[39m,
    [32m"Bipartisan"[39m,
    [32m"Accord"[39m,
    [32m"Ask"[39m,
    [32m"Big"[39m,
    [32m"Business"[39m,
    [32m"Republicans"[39m,
    [32m"Democrats"[39m,
    [32m"Republicans"[39m,
    [32m"Georgia\u2019s"[39m
  ),
  [33mArrayBuffer[39m(
    [32m"He\u2019s"[39m,
    [32m"Dogecoin"[39m,
    [32m"Millionaire"[39m,
    [32m"He\u2019s"[39m,
    [32m"Selling"[39m,
    [32m"Glauber"[39m,
    [32m"Contessoto"[39m,
    [32m"Contessoto"[39m,
    [32m"Dogecoin"[39m
  ),
  [33mArrayBuffer[39m(
    [32m"Retail"[39m,
    [32m"Sales"[39m,
    [32m"Flat"[39m,
    [32m"April"[39m,
    [32m"Retail"[39m,
    [32m"April"[39m,
    [32m"Americans"[39m,
    [32m"New"[39m,
    [32m"York"[39m,
    [32m"Retail"[39m,
    [32m"April"[39m
  ),
  [33mArrayBuffer[39m([32m"Navigate"[39m, [32m"Hot"[39m, [32m"