Skip to content
This repository has been archived by the owner on Mar 30, 2023. It is now read-only.

[QUESTION] Elasticsearch feeding #150

Closed
wcypierre opened this issue Jun 10, 2018 · 26 comments
Closed

[QUESTION] Elasticsearch feeding #150

wcypierre opened this issue Jun 10, 2018 · 26 comments

Comments

@wcypierre
Copy link

Make sure you've checked the following:

  • [Y] Python version is 3.5 or higher.
  • [Y] Using the latest version of Twint.
  • [Y] I have searched the issues and there are no duplicates of this issue/question/request.

Description of Issue

I've been trying to get the dashboard working but can't seem to get it working in the creating index step in https://github.com/haccer/twint/wiki/Elasticsearch
I ran "python3 Twint.py -es 192.168.88.17:9200 -u inter --since 2018-06-09" and I've been getting this error
twint

Tried passing a random essid and the sync went through but got stuck on the next step when I copy paste index-tweets.json on the Dev Tools and got resource_already_exists_exception
index already exists

Any clue on what went wrong on my steps?

Environment Details

OS: Ubuntu 18.04
Python: 3.6.5
Elasticsearch Version: 6.2.4
Twint: pulled latest from master branch

@wcypierre
Copy link
Author

This is what I get in the twint index if I pass the custom essid

{
  "twint": {
    "aliases": {},
    "mappings": {
      "items": {
        "properties": {
          "date": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "day": {
            "type": "long"
          },
          "essid": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "hour": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "id": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "likes": {
            "type": "boolean"
          },
          "link": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "location": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "replies": {
            "type": "boolean"
          },
          "retweets": {
            "type": "boolean"
          },
          "timezone": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "tweet": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "user_id": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "user_rt": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "username": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          }
        }
      }
    },
    "settings": {
      "index": {
        "creation_date": "1528635034991",
        "number_of_shards": "5",
        "number_of_replicas": "1",
        "uuid": "QGha27oFSyWrzj-WyzD4cA",
        "version": {
          "created": "6020499"
        },
        "provided_name": "twint"
      }
    }
  }
}

@pielco11 pielco11 added the bug label Jun 10, 2018
@pielco11
Copy link
Member

This happen because config.Essid was not set and so its type is None not str (my bad)

We reverted the commit because of an issue with dbs, now it should work as expected

(I did not test your index so don't know if there could be error with it)

@pielco11
Copy link
Member

pielco11 commented Jun 10, 2018

The issue that you see in DevTools is because an index with that name already exists, to delete an index run DELETE *index-name* where *index-name* is the name of the index that you want to delete

You can choose random names for indexes, in that chase you'll have to review twint/elasticsearch.py and change _index: in the json objects

@wcypierre
Copy link
Author

What value should I put in config.Essid?

@wcypierre
Copy link
Author

I have tried that as well. I've deleted the Twint index, did step 2 first (which is to create the index twint first from the index-tweets.json file) and then run step 1 but to no avail

@pielco11
Copy link
Member

For config.Essid choose whatever you prefer, letters and/or numbers

Can you elaborate a bit on what you did, please?

@wcypierre
Copy link
Author

In short, I've done the two steps here in different ordering but I don't see the output that I'm supposed to see in https://github.com/haccer/twint/wiki/Elasticsearch

A. Let Twint generate the index (original step in the wiki)
Ran "python3 Twint.py -es 192.168.88.17:9200 -u inter --since 2018-06-09" and it generated the twint index, however in Create Index Pattern, I can't see any dropdown value for the Time filter (supposedly so due to the Date is of text type instead of date type compared to the one in index-tweets.json
image

I do see values in Discover but it is not as what is in the wiki

B. Pre-create the Twint index using index-tweets.json
Ran "python3 Twint.py -es 192.168.88.17:9200 -u inter --since 2018-06-09" and I can generate the index and see the Date value in the Time filter dropdown but I see nothing in Discover

@pielco11
Copy link
Member

Very clear, will look at it asap

@wcypierre
Copy link
Author

I am guessing it is because of the metadata difference between the one in index-tweets.json and the one that is generated by twint #150 (comment)
but I'm new to elasticsearch so I'm not sure of how it works in terms of data mapping to the schema

@pielco11
Copy link
Member

Hi,

thanks for waiting.
So the main issue is that Elasticsearch creates a new index if it does not find the index which you are indexing to, and that index differs a bit from what I chose (mine is not a rule, just what I'm suggesting).

You can create new indexes as explained in the Wiki and changing the index name (PUT customindex ...) before indexing with Twint with the new customindex index.
I'll edit the code so that this step will be automated.

You can also index before creating the index in Elasticsearch, but that you'll have to run DELETE customindex in DevTools, than PUT customindex ..., and after that update the index pattern

screenshot_19

PS: now config.Essid can be None

@pielco11
Copy link
Member

Closing since I don't get the error again, in case feel free to reopen and/or comment

@ghost
Copy link

ghost commented Nov 28, 2018

Hello,

I'm having the same problem when I'm trying to visualize the documents, because the dashboard I have is not the same as https://github.com/twintproject/twint/wiki/Elasticsearch. I mean the graphic appear at the top pf the page in the discover option. What's wrong?

Thank you!
captura

@pielco11
Copy link
Member

Hello @alannaparker45 ,

did you choose the time field before creating the index? Here an example on what you should see, not related to twint.

@ghost
Copy link

ghost commented Dec 3, 2018

I didn't have the option to put @timestamp. My options are the ones I attach.
imagen

pielco11 added a commit that referenced this issue Dec 3, 2018
@pielco11
Copy link
Member

pielco11 commented Dec 3, 2018

@alannaparker45 that was a typo and now it's fixed

immagine

immagine

immagine

immagine

@pielco11
Copy link
Member

pielco11 commented Dec 3, 2018

Please note that twintuser index is created as well because Twint scrapes also the informations of the "target user", the one that you specify with -u/--username argument

PLUS:
the error was caused by the different field type, geopoint instead of geo_point, and so Elasticsearch did not recognize that type, thus it created another index different from the one that we want. We use custom date fields and Elasticsearch recognize them as text and not what them are meant to be

@ghost
Copy link

ghost commented Dec 3, 2018

@pielco11 thanks,
I'd fixed the first issue related to the typo geo_point, but it still not working, i cannot find the file to edit this part of code where the "someindex3" is located.
49377810-640bdc80-f70b-11e8-8f64-2f2b14fa75f2

@pielco11
Copy link
Member

pielco11 commented Dec 3, 2018

@alannaparker45 that is just a script that I wrote to do some testing

I suggest you to go to Dev Tools panel and delete old indexes, then re-run the code that you were trying to play with

immagine

immagine

So for example, if you were trying to scrape tweets tweeted by google, now you have to run a code similar to this:

import twint

c = twint.Config()
c.Username = "google"
c.Elasticsearch = "localhost:9200"

twint.run.Search(c)

PLUS:
if you want to use a specific index name for tweets index just specify c.Index_tweets = "mycustomname" if you are using Twint as module, or -it/--index-tweets mycustomname if you are using Twint via CLI ("mycustomname" is the name that you want to give to the index, call it as you may prefer)

@ghost
Copy link

ghost commented Dec 4, 2018

Thanks for your fast response. Yesterday I unistalled and installed again Kibana and Elastic, just in case that something was wrong with the code. When I tried to make a new Index, I had the same problem. Before I tried to create a new index, I changed geopoint into geo_point.

@pielco11
Copy link
Member

pielco11 commented Dec 4, 2018

@alannaparker45 so I guess that with the new setup and fix everything is working fine, if not just say

@ghost
Copy link

ghost commented Dec 4, 2018

The option of the Index are the same of the other day and in the Index "twinttweets" there aren't any options. It must appear "date" or "timestamp" isn't it?
imagen
imagen

@pielco11
Copy link
Member

pielco11 commented Dec 4, 2018

@alannaparker45 the first screenshot is about the index for users, twintuser, the second one is the index for tweets, twinttweets, and in your case it seems that this index does not have a date/time field

So now that you have to do is:

Delete the tweets index

immagine

Delete the user index

immagine

Create a new one by indexing some tweets

immagine

Here t.py is just a script that I wrote, what it does is just indexing some tweets of an account. This script is custom and not provided in the Twint repo, it's just an example.

Create the index pattern for tweets

immagine

Create the index pattern for users

(here you can choose between join_date, join_datetime and join_time, I suggest you to choose join_date but feel free to use what is best for your use case)

immagine

FAQ

Q: But I do not see that Twint created and index: [+] Index "twintuser" created! and/or [+] Index "twinttweets" created! are not printed, why?

A: Yesterday I pushed a fix, in order to have the correct type fields so you have to update your local repo/package as well.

Run pip3 install --upgrade -e git+https://github.com/twintproject/twint.git@origin/master#egg=twint to update your local package

@ghost
Copy link

ghost commented Dec 5, 2018

Now it's working! The thing was I have to upgrade the code and then make the module to creat a new index!

Thanks!

@murchie85
Copy link

Hi guys, when i try to run the twint as shown in the instructions I get bad request 400, any ideas?

image

@pielco11
Copy link
Member

pielco11 commented Jun 26, 2019 via email

@murchie85
Copy link

Thanks you, that fixed the issue!

pull bot pushed a commit to security-geeks/twint that referenced this issue Oct 20, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Development

No branches or pull requests

3 participants