<a href="https://colab.research.google.com/github/saulobritto/bioinfo-tools-gcolab/blob/main/SCE.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Search and Count Engine (SCE)

---

Author:

* Saulo Britto da Silva
 * saulobdasilva@gmail.com
 * https://github.com/saulobritto

Last update: 08 Jan. 2021


>Program to search for a word/term/gene in multiple files in multiple subfolders, returning how many times the term appears in each file and generating a search CSV, which can be downloaded.

---



##Results to CSV



In [None]:
#Run this cell to mount the drive (for Colab to have access to My Drive)
from google.colab import drive
drive.mount("/content/drive")

###Código Fonte


---
**Run all cells in this section**

You can do this just by clicking the run button above, without having to open the hidden cells


---



In [None]:
#importing necessary packages
import os
import pathlib
import pandas as pd
import numpy as np

In [None]:
#main code
def sce(palavra, onde, tipo, saida):
  string = [palavra]
  x = []
  files = list(pathlib.Path(onde).glob("**/*"+tipo))
  z = ['Arquivo', 'Palavra', 'Contagem']
  for filename in files:
    with open(filename, 'r') as f:
        data = f.read()
        for s in string:
          p = os.path.basename(filename)
          final = [p[:-4], s, data.count(s)]
          x.append(final)
  #creating table
  n = pd.DataFrame(np.array(x), columns = z)
  #generating CSV
  n.to_csv(saida+'.csv')
  print(n)
  info = pd.DataFrame.info(n)
  print(info)
  #plotting a graph per count
  n.groupby(['Contagem']).Contagem.count().plot(kind='bar', title='Quantidade de arquivos por contagem',legend='Contagem', xlabel='Contagem', ylabel='Quantidade de arquivos')
  

###Using SCE


---
**sce(word, where, file-type, output)**

* word = word to be searched

* where = folder where subfolders containing files are

* file-type = file types to be searched

* output = output file name



In [None]:
sce('word/gene', 'folder-containing-subfolders-path', 'file-type', 'output-file-name')

###CSV Download


---
Insert the 'output-file-name.csv' below


In [None]:
from google.colab import files

#Insert the 'output-file-name.csv' below
files.download("output-file-name.csv")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

##Results to Google Sheets



In [None]:
#Run this cell to mount the drive (for Colab to have access to My Drive)
from google.colab import drive
drive.mount("/content/drive")

In [None]:
#Install the gspread package, which google created to integrate Google Colab with Google Sheets
!pip install --upgrade gspread

In [None]:
#Release authentication and credentials to connect Colab to Drive, allowing you to import / export files
from google.colab import auth
auth.authenticate_user()

import gspread
from oauth2client.client import GoogleCredentials

gc = gspread.authorize(GoogleCredentials.get_application_default())

In [None]:
#Create a spreadsheet, enter the name below
planilha = gc.create('spreadsheet-name')

In [None]:
#Open the spreadsheet to edit
worksheet = gc.open('spreadsheet-name').sheet1

###Código Fonte


---
**Run all cells in this section**

You can do this just by clicking the run button above, without having to open the hidden cells


---


In [None]:
#Importing necessary packages
import os
import pathlib
import pandas as pd
import numpy as np

In [None]:
#Main code
def sce(palavra, onde, tipo):
  string = [palavra]
  x = []
  files = list(pathlib.Path(onde).glob("**/*"+tipo))
  z = ['Arquivo', 'Palavra', 'Contagem']
  for filename in files:
    with open(filename, 'r') as f:
        data = f.read()
        for s in string:
          p = os.path.basename(filename)
          final = [p[:-4], s, data.count(s)]
          x.append(final)
  worksheet.clear()
  worksheet.update(x)
  worksheet.insert_row(z, index=1)
  worksheet.freeze(rows=1)
  rows = worksheet.get_all_values()
  rows = pd.DataFrame.from_records(rows)
  info = pd.DataFrame.info(rows)
  print(rows)
  print(info)
  n = pd.DataFrame(np.array(x), columns = z)
  #Create a CSV to plot a graph
  n.to_csv('cache.csv')
  #Plotting a graph
  n.groupby(['Contagem']).Contagem.count().plot(kind='bar', title='Quantidade de arquivos por contagem',legend='Contagem', xlabel='Contagem', ylabel='Quantidade de arquivos')


###Using SCE


---
**sce(word, where, type)**

* word = word to be searched

* where = folder where subfolders containing files are

* type = file type to be searched



In [None]:
sce('word/term/gene', 'where', 'type')

###Sharing the spreadsheet


---
Just enter the email(s) or delete the line



In [None]:
#Share the generated table with other users, giving specific permissions
#Copy and paste the code or delete with as many emails you want to share

planilha.share('EMAIL1@gmail.com', perm_type='user', role='writer')
planilha.share('EMAIL2@gmail.com', perm_type='user', role='writer')
planilha.share('EMAIL3@gmail.com', perm_type='user', role='writer')
