# Reading from MySQL

**mySQL**
- 널리 쓰이는 오픈소스 소프트웨어
- 인터넷 기반으로 많이 쓰인다
- 데이터가 다음과 같이 구성되어 있다
    - 데이터베이스
    - 데이터베이스 내의 테이블
    - 테이블 내의 필드
- 각 행은 '레코드'라고 부른다


![title](./MySQL.png)


<br>


### 설치 방법

1. Install MySQL
`mysql-server` 와 `mysql-client`를 설치해준다.

2. Install RMySQL


http:/cran.r-project.org/web/packages/RMySQL/RMySQL.pdf 


http://www.pants.org/software/mysql/mysqlcommands.html



In [2]:
library(RMySQL)

Loading required package: DBI


### UCSC database
아주 유명한 genomic 데이터베이스이고, MySQL 서버를 가지고 있다.

http://genome.ucsc.edu/goldenPath/help/mysql.html


### 1. Connecting And Listing


In [3]:
ucscDb <- dbConnect(MySQL(), user="genome", host="genome-mysql.cse.ucsc.edu")
    # 연결을 만든다 / handle을 만든다
result <- dbGetQuery(ucscDb, "show databases;"); dbDisconnect(ucscDb);
    # 만들어진 connection에서 mySQL 커넥션 쿼리를 보낸다
    # 그 후 반드시!! 연결해제!!
    # TRUE는 disconnect의 결과

In [4]:
result

Database
information_schema
ailMel1
allMis1
anoCar1
anoCar2
anoGam1
apiMel1
apiMel2
aplCal1
aptMan1


### 2. Connecting to hg19 and listing tables

In [5]:
hg19 <- dbConnect(MySQL(), user="genome", db="hg19",
                 host="genome-mysql.cse.ucsc.edu")
allTables <- dbListTables(hg19)
length(allTables)

In [7]:
allTables[1:5]
# 각각의 테이블은 서로 다른 데이터셋이다.

### 3. Get dimensions of a specific table

In [8]:
dbListFields(hg19, "affyU133Plus2")

In [9]:
# 모든 행(record)의 수를 센다
dbGetQuery(hg19, "select count(*) from affyU133Plus2")

count(*)
58463


### 4. Read From the Table

In [12]:
affyData <- dbReadTable(hg19, "affyU133Plus2")
head(affyData)

“Unsigned INTEGER in col 18 imported as numeric”

bin,matches,misMatches,repMatches,nCount,qNumInsert,qBaseInsert,tNumInsert,tBaseInsert,strand,⋯,qStart,qEnd,tName,tSize,tStart,tEnd,blockCount,blockSizes,qStarts,tStarts
585,530,4,0,23,3,41,3,898,-,⋯,5,603,chr1,249250621,14361,15816,5,931442297021,34132278541611,1436114454145991496815795
585,3355,17,0,109,9,67,9,11621,-,⋯,0,3548,chr1,249250621,14381,29483,17,73375711653033601986612011260250747398155163,87165540647818112314841682234325452546280830583133320633173472,1438114454149691507515240155431590316104168531705417232174921791417988182672473629320
585,4156,14,0,83,16,18,2,93,-,⋯,3,4274,chr1,249250621,14399,18745,18,6901032333764515511741277859141514431253,447357467798131190119512011217122312351243128515642423256526173062,143991508915099151311516415540155441554915564155691558015587156281590616857169981704917492
585,4667,9,0,68,21,42,3,5743,-,⋯,48,4834,chr1,249250621,14406,24893,23,993522862449146581491444981210355837598150013362458,0994527397648148298368428511001101610611160117311841540238124412450395141034728,1440620227205792086520889209382095220958209632097121120211342117821276212882129821653224922255122559240592421124835
585,5180,14,0,167,10,38,1,29,-,⋯,0,5399,chr1,249250621,19688,25078,11,1312613006411473583359155,013215914601467147214841489149718565244,1968819819198452114521151211552116621170211772153524923
585,468,5,0,14,0,0,0,0,-,⋯,0,487,chr1,249250621,27542,28029,1,487,0,27542


### 5. Select a specific subset

보통 데이터베이스에서 한 테이블을 매우 거대하여 R에서 불러오기 힘들 수 있다.

이 때, 쿼리를 통해 서브셋만을 추출한다

In [13]:
query <- dbSendQuery(hg19, "select * from affyU133Plus2 where misMatches between 1 and 3")
affyMis <- fetch(query); quantile(affyMis$misMatches)

“Unsigned INTEGER in col 18 imported as numeric”

In [14]:
affyMisSmall <- fetch(query, n=10); dbClearResult(query);

In [15]:
dim(affyMisSmall)

In [16]:
dbDisconnect(hg19)

# [Reading from Web](http://en.wikipedia.org/wiki/Web-scraping)

**Webscraping** : 웹사이트에서 HTML 코드를 긁어오는 것

- "How Netfile reverse engineered Hollywood"
- 많은 웹사이트는 당신이 체계적으로 읽고 싶은 정보가 있다
- 가끔 웹사이트의 이용 약관에 어긋나기도 하다
- 너무 많은 페이지를 너무 빠르게 읽으면 IP가 블럭당할 수 있다

http://www.r-bloggers.com/?s=Web+Scraping

http://cran.r-project.org/web/packages/httr/httr.pdf


### Example : Google Scholar

http://scholar.google.com/citations?user=HI-I6C0AAAAJ&hl=en"

### 1. Getting data off webpages - readLines()

In [19]:
con = url("http://scholar.google.com/citations?user=HI-I6C0AAAAJ&hl=en")
htmlCode = readLines(con)
close(con)
htmlCode

“'http://scholar.google.com/citations?user=HI-I6C0AAAAJ&hl=en'에서 불완전한 마지막 행이 발견되었습니다”IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.


In [20]:
head(htmlCode,1)

### 2. Parsing with XML

항상 그렇지만, XML 패키지는 오류가 많다.

httr GET을 연습하자!


In [26]:
library(XML)
library(httr)

In [29]:
url <- "https://scholar.google.com/citations?user=HI-I6C0AAAAJ&hl=en"
html <- htmlTreeParse(rawToChar(GET(url)$content), useInternalNodes = T)

xpathSApply(html, "//title", xmlValue)

In [32]:
xpathSApply(html, "//td[@id='col-citedby']", xmlValue)

##### GET from the httr package

In [33]:
html2 = GET(url)
html2

Response [https://scholar.google.com/citations?user=HI-I6C0AAAAJ&hl=en]
  Date: 2018-03-12 12:46
  Status: 200
  Content-Type: text/html; charset=ISO-8859-1
  Size: 130 kB
<!doctype html><html><head><title>Jeff Leek - Google Scholar Citations</title...
2]=arguments[e];return b.prototype[c].apply(a,d)}};var ia=function(a){var b=[...
document.documentElement.classList||function(){function a(a){return(a=(a=a.cl...
qa=ra?0<+ra[1]:r("Android")?!0:window.matchMedia&&window.matchMedia("(pointer...
c.type="hidden",c.name=b,a.appendChild(c));return c},ya=function(a){t("gsc_md...
var Ca=function(a){var b=a.b,c=b.length;a=a.m;for(var d=0,e=0;e<c;e++){var g=...
a},Ea=function(a,b,c,d,e){a.addEventListener(b,c,La(d,e))},Ga=function(a,b,c,...
var Ra=function(){Ha(["mousedown","touchstart"],function(){q(document.documen...
a;){var c=b||D.length>a+1;D.pop()(!!c)}},Va=function(a){for(var b=0;a&&!(b=C[...
y(document,"focus",function(a){var b=D.length;if(b)for(var c=Va(a.target);c<b...
...

In [34]:
content2 = content(html2, as="text")
parsedHtml = htmlParse(content2, asText=T)
xpathSApply(parsedHtml, "//title", xmlValue)

##### Accessing websites with passwords

In [36]:
pg1 = GET("http://httpbin.org/basic-auth/user/passwd", authenticate("user", "passwd"))
pg1

Response [http://httpbin.org/basic-auth/user/passwd]
  Date: 2018-03-12 12:48
  Status: 200
  Content-Type: application/json
  Size: 47 B
{
  "authenticated": true, 
  "user": "user"
}

In [37]:
names(pg1)

##### Using Handles

In [38]:
google = handle("http://google.com")
pg1 =GET(handle=google, path="/")
pg2 = GET(handle=google, path="search")