# Named Entity Recognition Pipeline

El pipeline toma una URL de un feed en formato RSS, obtiene el título y descripción de los artículos en el feed, detecta las NER con un modelo pre-entrenado, y las muestra ordenadas por frecuencia de aparición.

### Versiones
Probado con:
* Almond 0.6.0
* Ammonite 1.6.7
* Scala library version **2.11.12** -- Copyright 2002-2017, LAMP/EPFL
* Java 1.8.0_282

Para ver más información ir a (Help -> About Scala Kernel)

## 1. Obtener texto

### 1.1 Importar librerías

In [105]:
import $ivy.`org.scalaj::scalaj-http:2.4.2`

[32mimport [39m[36m$ivy.$                              [39m

In [106]:
import $ivy.`org.scalaj::scalaj-http:2.4.2`
// "org.scala-lang.modules" %% "scala-xml" % "1.3.0"
import $ivy.`org.scala-lang.modules::scala-xml:1.3.0`

[32mimport [39m[36m$ivy.$                              
// "org.scala-lang.modules" %% "scala-xml" % "1.3.0"
[39m
[32mimport [39m[36m$ivy.$                                        [39m

In [107]:
import scalaj.http.{Http, HttpResponse}
import scala.xml.XML

[32mimport [39m[36mscalaj.http.{Http, HttpResponse}
[39m
[32mimport [39m[36mscala.xml.XML[39m

In [108]:
import $ivy.`org.json4s::json4s-jackson:3.4.0`
import org.json4s.JsonDSL._
import org.json4s._
import org.json4s.jackson.JsonMethods._
implicit val formats = DefaultFormats

[32mimport [39m[36m$ivy.$                                 
[39m
[32mimport [39m[36morg.json4s.JsonDSL._
[39m
[32mimport [39m[36morg.json4s._
[39m
[32mimport [39m[36morg.json4s.jackson.JsonMethods._
[39m
[36mformats[39m: [32mDefaultFormats[39m.type = org.json4s.DefaultFormats$@1065e3fa

In [109]:
import scala.collection.mutable.ListBuffer

[32mimport [39m[36mscala.collection.mutable.ListBuffer[39m

### 1.1 Obtener el texto del RSS Feed y REDDIT Feed

Creamos una clase la cual toma un argumento que definira como se usara.
Consta de un metodo queryURL(url: String) ---> Seq[String] el cual recibe una url y devuelve una lista de texto, la cual tiene los datos de dicha url.

Para RSS:

Realizamos una consulta HTTP, que nos devuelve una instancia de HTTPResponse. Dentro del atributo `body` de la HTTPResponse, se encuentra el texto del feed en formato XML. Luego, se parsea el XML para extraer los campos `title` y `description`.

Para REDDIT:

Realizamos una consulta HTTP, que nos devuelve una instancia de HTTPResponse.Dentro del atributo `body` de la HTTPResponse, se encuentra el texto del feed en formato XML.Luego, se parsea con JSON para extraer los campos `title` y `selftext`.

In [110]:
val url1 = "https://www.reddit.com/r/Android/hot/.json?count=10"
val url2 = "https://www.chicagotribune.com/arcio/rss/category/sports/?query=display_date:[now-2d+TO+now]&sort=display_date:desc"

[36murl1[39m: [32mString[39m = [32m"https://www.reddit.com/r/Android/hot/.json?count=10"[39m
[36murl2[39m: [32mString[39m = [32m"https://www.chicagotribune.com/arcio/rss/category/sports/?query=display_date:[now-2d+TO+now]&sort=display_date:desc"[39m

In [111]:
class GetTextURL(urlType: String){
    // Obtener texto desde una url
    def queryURL(url: String): Seq[String] = {
        try{
            val response: HttpResponse[String] = Http(url)
              .timeout(connTimeoutMs = 2000, readTimeoutMs = 5000)
              .asString
            val stringBody = response.body
            urlType match {
                case "rss" => {
                    val xml = XML.loadString(stringBody)
                    // Extract text from title and description
                    (xml \\ "item").map { item => ((item \ "title").text + " " + (item \ "description").text) }
                }
                case "reddit" => {
                    // parse Reddit feed in JSON
                    val result = (parse(stringBody) \ "data" \ "children" \ "data")
                         .extract[List[Map[String, Any]]]
                    // Parsear JSON
                    val filterContent = result.flatten.filter{case (v , _) => v == "title" || v == "selftext" }.map(x => x._2.toString)
                    val pattern = "(https?|ftp|file)://[-a-zA-Z0-9+&@#/%?=~_|!:,.;]*[-a-zA-Z0-9+&@#/%=~_|]".r
                    filterContent.map(x => pattern.replaceAllIn(x,"")).toSeq
                }
            }
        }catch{
            case e: Exception => List()
        }    
    }
}

defined [32mclass[39m [36mGetTextURL[39m

In [114]:
val rss = new GetTextURL("rss")
val rssT = rss.queryURL(url2)

[36mrss[39m: [32mGetTextURL[39m = ammonite.$sess.cmd110$Helper$GetTextURL@5c4f583b
[36mrssT[39m: [32mSeq[39m[[32mString[39m] = [33mList[39m(
  [32m"Andrew Vaughn\u2019s 1st major-league home run helps power the Chicago White Sox to their 5th straight win: \u2018Pretty special moment. It came at a good time, too.\u2019 Andrew Vaughn hit the first home run of his major-league career in the Chicago White Sox's 13-8 victory against the Minnesota Twins at Guaranteed Rate Field. The Sox won their fifth straight, and it's their fourth consecutive game scoring at least nine runs."[39m,
  [32m"The highs and lows of the 2020-21 Chicago Blackhawks: From Jonathan Toews\u2019 season-long absence to Patrick Kane\u2019s milestone and Kirby Dach\u2019s injury and return The Chicago Blackhawks' 2020-21 season started with offseason news about Kirby Dach but held a few other surprises. Here's a timeline of their year."[39m,
  [32m"The Chicago Bears\u2019 2021 schedule is out. Here are o

In [115]:
val reddit = new GetTextURL("reddit")
val redditT = reddit.queryURL(url1)

[36mreddit[39m: [32mGetTextURL[39m = ammonite.$sess.cmd110$Helper$GetTextURL@38676b12
[36mredditT[39m: [32mSeq[39m[[32mString[39m] = [33mList[39m(
  [32m"""**Credits to the team at /r/PickAnAndroidForMe for compiling this information:**

* Home - 

* Smartphones 101 - 

* Top Phones - 


***        
Note 1. Join us at /r/MoronicMondayAndroid, a sub serving as a repository for our retired weekly threads. Just pick any thread and Ctrl-F your way to wisdom! 

Note 2. Join our IRC, and Telegram chat-rooms! [Please see our wiki for instructions.]()"""[39m,
  [32m"What should I buy Thursday (May 13 2021) - Your weekly device inquiry thread!"[39m,
  [32m"""Hey /r/android, we had conducted a [feedback poll]() last year. We have decided to make a few revisions (mainly a preferential voting system) since some of the results were rather open-ended and did not present a clear solution. Participation on the last poll was also lower than anticipated since we only got about [930 resp

## 2. Detectar las entidades nombradas

### 2.1 Crear la clase encargada del modelo,contar,y ordenar Entidades

Los metodos de esta clase son :

getNEs(textList: Seq[String]) ---> Seq[Seq[String]] el cual que recibe una lista de textos y aplica el modelo, para cada texto, se separa las palabras del texto usando los espacios, y considera que es una entidad nombrada si empieza con mayúscula.

count(textL : Seq[Seq[String]]) ---> Map[String,Int] el cual recibe el resultado de aplicar el modelo, devuelve la entidad junto las cantidad de veces que aparece nombrada

sort(counts: Map[String,Int]) ---> List[(String, Int)] el cual ordena el resultado de contar las entidades

In [116]:
class NERModel() {
    // Variables Necesarias para crear modelo
    private val STOPWORDS = Seq (
        "i", "me", "my", "myself", "we", "our", "ours", "ourselves", "you",
        "yours", "yourself", "yourselves", "he", "him", "his", "himself", "she",
        "her", "hers", "herself", "it", "its", "itself", "they", "them", "your",
        "their", "theirs", "themselves", "what", "which", "who", "whom",
        "this", "that", "these", "those", "am", "is", "are", "was", "were",
        "be", "been", "being", "have", "has", "had", "having", "do", "does",
        "did", "doing", "a", "an", "the", "and", "but", "if", "or",
        "because", "as", "until", "while", "of", "at", "by", "for", "with",
        "about", "against", "between", "into", "through", "during", "before",
        "after", "above", "below", "to", "from", "up", "down", "in", "out",
        "off", "over", "under", "again", "further", "then", "once", "here",
        "there", "when", "where", "why", "how", "all", "any", "both", "each",
        "few", "more", "most", "other", "some", "such", "no", "nor", "not",
        "only", "own", "same", "so", "than", "too", "very", "s", "t", "can",
        "will", "just", "don", "should", "now", "on",
        // Contractions without '
        "im", "ive", "id", "Youre", "youd", "youve",
        "hes", "hed", "shes", "shed", "itd", "were", "wed", "weve",
        "theyre", "theyd", "theyve",
        "shouldnt", "couldnt", "musnt", "cant", "wont",
        // Common uppercase words
        "hi", "hello"
    )
    private val punctuationSymbols = ".,()!?;:'`´\n"
    private val punctuationRegex = "\\" + punctuationSymbols.split("").mkString("|\\")
    
    // Aplicar el Modelo a los datos (simplemente es aplicar la funcion a la lista de textos)
    private def getNEsSingle(text: String): Seq[String] =
      text.replaceAll(punctuationRegex, "").split(" ")
        .filter { word:String => word.length > 1 &&
                  Character.isUpperCase(word.charAt(0)) &&
                  !STOPWORDS.contains(word.toLowerCase) }.toSeq

    def getNEs(textList: Seq[String]): Seq[Seq[String]] = textList.map(getNEsSingle)
    
    // Contar las entidades
    def count(textL : Seq[Seq[String]]): Map[String,Int] = {
            textL.flatten.foldLeft(Map.empty[String, Int]) {
             (count, word) => count + (word -> (count.getOrElse(word, 0) + 1)) }
    }
    //Ordenar Entidades
    def sort(counts: Map[String,Int]): List[(String, Int)] = {
        counts.toList
          .sortBy(_._2)(Ordering[Int].reverse)
    }
}
       

defined [32mclass[39m [36mNERModel[39m

In [117]:
val model = new NERModel

[36mmodel[39m: [32mNERModel[39m = ammonite.$sess.cmd115$Helper$NERModel@2ad8afa0

### 2.2 Aplicar el "Modelo" a los datos

In [118]:
val result_rss = model.getNEs(rssT)
val result_reddit = model.getNEs(redditT)

[36mresult_rss[39m: [32mSeq[39m[[32mSeq[39m[[32mString[39m]] = [33mList[39m(
  [33mArrayBuffer[39m(
    [32m"Andrew"[39m,
    [32m"Vaughn\u2019s"[39m,
    [32m"Chicago"[39m,
    [32m"White"[39m,
    [32m"Sox"[39m,
    [32m"Andrew"[39m,
    [32m"Vaughn"[39m,
    [32m"Chicago"[39m,
    [32m"White"[39m,
    [32m"Soxs"[39m,
    [32m"Minnesota"[39m,
    [32m"Twins"[39m,
    [32m"Guaranteed"[39m,
    [32m"Rate"[39m,
    [32m"Field"[39m,
    [32m"Sox"[39m
  ),
  [33mArrayBuffer[39m(
    [32m"Chicago"[39m,
    [32m"Blackhawks"[39m,
    [32m"Jonathan"[39m,
    [32m"Toews\u2019"[39m,
    [32m"Patrick"[39m,
    [32m"Kane\u2019s"[39m,
    [32m"Kirby"[39m,
    [32m"Dach\u2019s"[39m,
    [32m"Chicago"[39m,
    [32m"Blackhawks"[39m,
    [32m"Kirby"[39m,
    [32m"Dach"[39m,
    [32m"Heres"[39m
  ),
  [33mArrayBuffer[39m(
    [32m"Chicago"[39m,
    [32m"Bears\u2019"[39m,
    [32m"Chicago"[39m,
    [32m"Bears"[39m,
...
[3

## 3. Contar y ordenar las entidades

Concatenar todas las listas, contar cada Named Entity, y luego ordernar por frecuencia

In [119]:
val count_rss = model.count(result_rss)
val count_reddit = model.count(result_reddit)

[36mcount_rss[39m: [32mMap[39m[[32mString[39m, [32mInt[39m] = [33mMap[39m(
  [32m"Rate"[39m -> [32m4[39m,
  [32m"University"[39m -> [32m1[39m,
  [32m"Nets"[39m -> [32m3[39m,
  [32m"Gleyber"[39m -> [32m1[39m,
  [32m"Spirit\u2019s"[39m -> [32m1[39m,
  [32m"Corey"[39m -> [32m2[39m,
  [32m"Marisnick"[39m -> [32m1[39m,
  [32m"Polisky"[39m -> [32m4[39m,
  [32m"Operations"[39m -> [32m1[39m,
  [32m"Cub"[39m -> [32m1[39m,
  [32m"Parker"[39m -> [32m4[39m,
  [32m"Crawford"[39m -> [32m2[39m,
  [32m"Bieber"[39m -> [32m2[39m,
  [32m"Sky"[39m -> [32m3[39m,
  [32m"Hawaii"[39m -> [32m2[39m,
  [32m"Washington"[39m -> [32m1[39m,
  [32m"IL"[39m -> [32m1[39m,
  [32m"Robert"[39m -> [32m1[39m,
  [32m"Dach\u2019s"[39m -> [32m1[39m,
  [32m"Hockey"[39m -> [32m1[39m,
  [32m"DeBrincat\u2019s"[39m -> [32m1[39m,
  [32m"Marisnick\u2019s"[39m -> [32m2[39m,
  [32m"President"[39m -> [32m1[39m,
  [32m"Alex"[39m -> [3

In [120]:
val sort_rss = model.sort(count_rss)
val sort_reddit = model.sort(count_reddit)

[36msort_rss[39m: [32mList[39m[([32mString[39m, [32mInt[39m)] = [33mList[39m(
  ([32m"Chicago"[39m, [32m49[39m),
  ([32m"White"[39m, [32m12[39m),
  ([32m"Sox"[39m, [32m12[39m),
  ([32m"Blackhawks"[39m, [32m11[39m),
  ([32m"Minnesota"[39m, [32m8[39m),
  ([32m"Twins"[39m, [32m8[39m),
  ([32m"NFL"[39m, [32m8[39m),
  ([32m"Cubs"[39m, [32m7[39m),
  ([32m"Bears"[39m, [32m7[39m),
  ([32m"Tuesday"[39m, [32m6[39m),
  ([32m"Bulls"[39m, [32m6[39m),
  ([32m"Photos"[39m, [32m6[39m),
  ([32m"Wednesday"[39m, [32m5[39m),
  ([32m"Patrick"[39m, [32m5[39m),
  ([32m"Rate"[39m, [32m4[39m),
  ([32m"Polisky"[39m, [32m4[39m),
  ([32m"Parker"[39m, [32m4[39m),
  ([32m"Mike"[39m, [32m4[39m),
  ([32m"Guaranteed"[39m, [32m4[39m),
  ([32m"Candace"[39m, [32m4[39m),
  ([32m"Field"[39m, [32m4[39m),
  ([32m"Cleveland"[39m, [32m4[39m),
  ([32m"Indians"[39m, [32m4[39m),
  ([32m"Nets"[39m, [32m3[39m),
  ([32m"Sky"[3

## 4. FeedService


In [121]:
class FeedService(){
    //Variable donde almacenar las suscripciones
    private val buffer = new ListBuffer[(String,GetTextURL)]()
    
    //Guardar un registro de las URL, y opcionalmente sus parámetros, suscriptas.
    def suscribe(url: String, t: GetTextURL) : Unit= {
        buffer += ((url,t))
    }
    
    //Obtener los feeds
    def get_feed(): Seq[Seq[String]] = {
        val model = new NERModel
        val toSeq = buffer.toSeq
        toSeq.map{ x => x._2.queryURL(x._1)}
    }
    
    //Compilar el resultado de cada una en una única lista
    def get_result(feed: Seq[Seq[String]]): Seq[String] = {
        feed.flatten
    }
    
}

defined [32mclass[39m [36mFeedService[39m

In [122]:
val feed_service = new FeedService()

[36mfeed_service[39m: [32mFeedService[39m = ammonite.$sess.cmd120$Helper$FeedService@1fdc8c5e

In [123]:
feed_service.suscribe(url1, new GetTextURL("reddit"))
feed_service.suscribe(url1, new GetTextURL("reddit"))
feed_service.suscribe(url2, new GetTextURL("rss"))
feed_service.suscribe(url1, new GetTextURL("reddit"))

In [124]:
val get_feed = feed.get_feed

[36mget_feed[39m: [32mSeq[39m[[32mSeq[39m[[32mString[39m]] = [33mList[39m(
  [33mList[39m(
    [32m"""**Credits to the team at /r/PickAnAndroidForMe for compiling this information:**

* Home - 

* Smartphones 101 - 

* Top Phones - 


***        
Note 1. Join us at /r/MoronicMondayAndroid, a sub serving as a repository for our retired weekly threads. Just pick any thread and Ctrl-F your way to wisdom! 

Note 2. Join our IRC, and Telegram chat-rooms! [Please see our wiki for instructions.]()"""[39m,
    [32m"What should I buy Thursday (May 13 2021) - Your weekly device inquiry thread!"[39m,
    [32m"""Hey /r/android, we had conducted a [feedback poll]() last year. We have decided to make a few revisions (mainly a preferential voting system) since some of the results were rather open-ended and did not present a clear solution. Participation on the last poll was also lower than anticipated since we only got about [930 responses]().

NOTES

* The poll was created via Googl

In [125]:
val get_result = feed.get_result(get_feed)

[36mget_result[39m: [32mSeq[39m[[32mString[39m] = [33mList[39m(
  [32m"""**Credits to the team at /r/PickAnAndroidForMe for compiling this information:**

* Home - 

* Smartphones 101 - 

* Top Phones - 


***        
Note 1. Join us at /r/MoronicMondayAndroid, a sub serving as a repository for our retired weekly threads. Just pick any thread and Ctrl-F your way to wisdom! 

Note 2. Join our IRC, and Telegram chat-rooms! [Please see our wiki for instructions.]()"""[39m,
  [32m"What should I buy Thursday (May 13 2021) - Your weekly device inquiry thread!"[39m,
  [32m"""Hey /r/android, we had conducted a [feedback poll]() last year. We have decided to make a few revisions (mainly a preferential voting system) since some of the results were rather open-ended and did not present a clear solution. Participation on the last poll was also lower than anticipated since we only got about [930 responses]().

NOTES

* The poll was created via Google Forms and requires sign-in to preve

In [126]:
val model_N = new NERModel

[36mmodel_N[39m: [32mNERModel[39m = ammonite.$sess.cmd115$Helper$NERModel@f270a92

In [127]:
val do_model = model_N.getNEs(get_result)
val do_count = model_N.count(do_model)
val do_sort = model_N.sort(do_count)

[36mdo_model[39m: [32mSeq[39m[[32mSeq[39m[[32mString[39m]] = [33mList[39m(
  [33mArrayBuffer[39m(
    [32m"Home"[39m,
    [32m"Smartphones"[39m,
    [32m"Top"[39m,
    [32m"Phones"[39m,
    [32m"Note"[39m,
    [32m"Join"[39m,
    [32m"Ctrl-F"[39m,
    [32m"Note"[39m,
    [32m"Join"[39m,
    [32m"IRC"[39m,
    [32m"Telegram"[39m
  ),
  [33mArrayBuffer[39m([32m"Thursday"[39m, [32m"May"[39m),
  [33mArrayBuffer[39m(
    [32m"Hey"[39m,
    [32m"Participation"[39m,
    [32m"Google"[39m,
    [32m"Forms"[39m,
    [32m"Email"[39m,
    [32m"Responses"[39m,
    [32m"Well"[39m,
    [32m"POLL]Edit"[39m
  ),
  [33mArrayBuffer[39m([32m"Community"[39m, [32m"Feedback"[39m, [32m"Poll"[39m, [32m"February"[39m),
  [33mArrayBuffer[39m(),
  [33mArrayBuffer[39m([32m"Google"[39m, [32m"Messages"[39m),
  [33mArrayBuffer[39m(),
  [33mArrayBuffer[39m(),
  [33mArrayBuffer[39m(),
  [33mArrayBuffer[39m([32m"US"[39m, [32m"Agrees"[