# Named Entity Recognition Pipeline

El pipeline toma una URL de un feed en formato RSS, obtiene el título y descripción de los artículos en el feed, detecta las NER con un modelo pre-entrenado, y las muestra ordenadas por frecuencia de aparición.

### Versiones
Probado con:
* Almond 0.6.0
* Ammonite 1.6.7
* Scala library version **2.11.12** -- Copyright 2002-2017, LAMP/EPFL
* Java 1.8.0_282

Para ver más información ir a (Help -> About Scala Kernel)

## 1. Obtener texto

### 1.1 Importar librerías

In [1]:
import $ivy.`org.scalaj::scalaj-http:2.4.2`

[32mimport [39m[36m$ivy.$                              [39m

In [2]:
import $ivy.`org.scalaj::scalaj-http:2.4.2`
// "org.scala-lang.modules" %% "scala-xml" % "1.3.0"
import $ivy.`org.scala-lang.modules::scala-xml:1.3.0`

[32mimport [39m[36m$ivy.$                              
// "org.scala-lang.modules" %% "scala-xml" % "1.3.0"
[39m
[32mimport [39m[36m$ivy.$                                        [39m

In [3]:
import scalaj.http.{Http, HttpResponse}
import scala.xml.XML

[32mimport [39m[36mscalaj.http.{Http, HttpResponse}
[39m
[32mimport [39m[36mscala.xml.XML[39m

### 1.1 Obtener el texto del RSS Feed

Realizamos una consulta HTTP, que nos devuelve una instancia de HTTPResponse. Dentro del atributo `body` de la HTTPResponse, se encuentra el texto del feed en formato XML. Luego, se parsea el XML para extraer los campos `title` y `description`.

In [28]:
class RSS(){
    // Obtener texto desde una url
    def getRSSText (url: String): Seq[String] ={
        try{
            val response: HttpResponse[String] = Http(url)
              .timeout(connTimeoutMs = 2000, readTimeoutMs = 5000)
              .asString
            val xmlString = response.body
            // convert the String to a scala.xml.Elem
            val xml = XML.loadString(xmlString)
            // Extract text from title and description
            (xml \\ "item").map { item => ((item \ "title").text ++ " " ++ (item \ "description").text) }
        }
        catch{
        case e: Exception => Seq()
        }
    }
}

defined [32mclass[39m [36mRSS[39m

In [29]:
val rss = new RSS

[36mrss[39m: [32mRSS[39m = ammonite.$sess.cmd27$Helper$RSS@6dc1efe9

In [36]:
val rssText = rss.getRSSText("https://www.chicagotribune.com/arcio/rss/category/sports/?query=display_date:[now-2d+TO+now]&sort=display_date:desc")

[36mrssText[39m: [32mSeq[39m[[32mString[39m] = [33mList[39m(
  [32m"ESPN announces a contract extension for Chris Berman on his 66th birthday Chris Berman will continue to host \u201cNFL PrimeTime\u201d after agreeing to a new contract with ESPN. The multiyear agreement was announced on Berman\u2019s 66th birthday Monday."[39m,
  [32m"Tim Tebow is expected to reunite with Urban Meyer on the Jacksonville Jaguars \u2014 as a tight end Tim Tebow and Urban Meyer apparently are getting back together, this time in the NFL. The 2007 Heisman Trophy-winning quarterback at Florida is expected to team up with his college coach by signing a one-year contract to play tight end for the Jacksonville Jaguars."[39m,
  [32m"Cubs and White Sox fans are back at ballparks \u2014 with some 2021 adjustments Chicago baseball fans are back at ballparks for the 2021 season \u2014 with some adjustments. Masks are required, tickets are digital and social distancing is built into seating arrangements.

## 2. Detectar las entidades nombradas

### 2.1 Crear el modelo

El **modelo** es sólo la función `getNEs`, que recibe una lista de textos.
Para cada texto, se separa las palabras del texto usando los espacios, y considera que es una entidad nombrada si empieza con mayúscula.

Este código lista los signos de puntuación y algunas palabras comunes del inglés que se van a sacar del texto.

In [37]:
class NERModel() {
    // Variables Necesarias para crear modelo
    val STOPWORDS = Seq (
        "i", "me", "my", "myself", "we", "our", "ours", "ourselves", "you",
        "yours", "yourself", "yourselves", "he", "him", "his", "himself", "she",
        "her", "hers", "herself", "it", "its", "itself", "they", "them", "your",
        "their", "theirs", "themselves", "what", "which", "who", "whom",
        "this", "that", "these", "those", "am", "is", "are", "was", "were",
        "be", "been", "being", "have", "has", "had", "having", "do", "does",
        "did", "doing", "a", "an", "the", "and", "but", "if", "or",
        "because", "as", "until", "while", "of", "at", "by", "for", "with",
        "about", "against", "between", "into", "through", "during", "before",
        "after", "above", "below", "to", "from", "up", "down", "in", "out",
        "off", "over", "under", "again", "further", "then", "once", "here",
        "there", "when", "where", "why", "how", "all", "any", "both", "each",
        "few", "more", "most", "other", "some", "such", "no", "nor", "not",
        "only", "own", "same", "so", "than", "too", "very", "s", "t", "can",
        "will", "just", "don", "should", "now", "on",
        // Contractions without '
        "im", "ive", "id", "Youre", "youd", "youve",
        "hes", "hed", "shes", "shed", "itd", "were", "wed", "weve",
        "theyre", "theyd", "theyve",
        "shouldnt", "couldnt", "musnt", "cant", "wont",
        // Common uppercase words
        "hi", "hello"
    )
    val punctuationSymbols = ".,()!?;:'`´\n"
    val punctuationRegex = "\\" + punctuationSymbols.split("").mkString("|\\")
    
    // Aplicar el Modelo a los datos (simplemente es aplicar la funcion a la lista de textos)
    def getNEsSingle(text: String): Seq[String] =
      text.replaceAll(punctuationRegex, "").split(" ")
        .filter { word:String => word.length > 1 &&
                  Character.isUpperCase(word.charAt(0)) &&
                  !STOPWORDS.contains(word.toLowerCase) }.toSeq

    def getNEs(textList: Seq[String]): Seq[Seq[String]] = textList.map(getNEsSingle)
    
    // Contar y ordenar las entidades
    def countandsort(textL : Seq[Seq[String]]): List[(String, Int)] = {
        val counts: Map[String, Int] = textL.flatten
          .foldLeft(Map.empty[String, Int]) {
             (count, word) => count + (word -> (count.getOrElse(word, 0) + 1)) }
        counts.toList
          .sortBy(_._2)(Ordering[Int].reverse)
    }
}
       

defined [32mclass[39m [36mNERModel[39m

In [38]:
val model = new NERModel

[36mmodel[39m: [32mNERModel[39m = ammonite.$sess.cmd36$Helper$NERModel@5fd0d4ff

### 2.2 Aplicar el "Modelo" a los datos

In [39]:
val result = model.getNEs(rssText)

[36mresult[39m: [32mSeq[39m[[32mSeq[39m[[32mString[39m]] = [33mList[39m(
  [33mArrayBuffer[39m(
    [32m"ESPN"[39m,
    [32m"Chris"[39m,
    [32m"Berman"[39m,
    [32m"Chris"[39m,
    [32m"Berman"[39m,
    [32m"PrimeTime\u201d"[39m,
    [32m"ESPN"[39m,
    [32m"Berman\u2019s"[39m,
    [32m"Monday"[39m
  ),
  [33mArrayBuffer[39m(
    [32m"Tim"[39m,
    [32m"Tebow"[39m,
    [32m"Urban"[39m,
    [32m"Meyer"[39m,
    [32m"Jacksonville"[39m,
    [32m"Jaguars"[39m,
    [32m"Tim"[39m,
    [32m"Tebow"[39m,
    [32m"Urban"[39m,
    [32m"Meyer"[39m,
    [32m"NFL"[39m,
    [32m"Heisman"[39m,
    [32m"Trophy-winning"[39m,
    [32m"Florida"[39m,
    [32m"Jacksonville"[39m,
    [32m"Jaguars"[39m
  ),
  [33mArrayBuffer[39m([32m"Cubs"[39m, [32m"White"[39m, [32m"Sox"[39m, [32m"Chicago"[39m, [32m"Masks"[39m),
  [33mArrayBuffer[39m(
    [32m"DQ"[39m,
    [32m"Bob"[39m,
    [32m"Baffert"[39m,
    [32m"Q&A"[39m,
    [32m

## 3. Contar y ordenar las entidades

Concatenar todas las listas, contar cada Named Entity, y luego ordernar por frecuencia

In [40]:
val CountAndSortedNEs = model.countandsort(result)

[36mCountAndSortedNEs[39m: [32mList[39m[([32mString[39m, [32mInt[39m)] = [33mList[39m(
  ([32m"Chicago"[39m, [32m35[39m),
  ([32m"Sox"[39m, [32m10[39m),
  ([32m"Cubs"[39m, [32m10[39m),
  ([32m"White"[39m, [32m9[39m),
  ([32m"Medina"[39m, [32m6[39m),
  ([32m"Pirates"[39m, [32m6[39m),
  ([32m"Photos"[39m, [32m6[39m),
  ([32m"Bob"[39m, [32m5[39m),
  ([32m"Baffert"[39m, [32m5[39m),
  ([32m"Sunday"[39m, [32m5[39m),
  ([32m"Sky"[39m, [32m4[39m),
  ([32m"Derby"[39m, [32m4[39m),
  ([32m"Saturday"[39m, [32m4[39m),
  ([32m"Royals"[39m, [32m4[39m),
  ([32m"Spirit"[39m, [32m4[39m),
  ([32m"Center"[39m, [32m4[39m),
  ([32m"May"[39m, [32m4[39m),
  ([32m"Kentucky"[39m, [32m4[39m),
  ([32m"Field"[39m, [32m4[39m),
  ([32m"City"[39m, [32m4[39m),
  ([32m"Stars"[39m, [32m4[39m),
  ([32m"United"[39m, [32m4[39m),
  ([32m"Kansas"[39m, [32m4[39m),
  ([32m"Pittsburgh"[39m, [32m4[39m),
  ([32m"Wrigley"[39